Chapter 13: Comprehensive Bias Detection and Mitigation

Learning Objectives

By the end of this chapter, readers will be able to:

Define and distinguish between multiple fairness metrics appropriate for healthcare contexts, understanding when each is applicable and the tradeoffs between them including demographic parity, equalized odds, calibration within groups, and predictive value parity.
Implement comprehensive bias detection frameworks that systematically evaluate model performance across patient subgroups defined by protected attributes, social determinants of health, and care setting characteristics while accounting for intersectional identities.
Detect proxy discrimination where models rely on features correlated with protected attributes to make decisions, even when protected attributes themselves are not directly used as model inputs, through feature importance analysis and causal mediation techniques.
Analyze feature attribution patterns across demographic subgroups to identify when models are relying on different features or weighing the same features differently for different patient populations, which may indicate problematic learned associations.
Audit trained models for unexpected or harmful associations between patient characteristics and predicted outcomes that reflect historical discrimination rather than legitimate medical relationships, using techniques including model probing, counterfactual analysis, and systematic bias testing.
Design validation studies with adequate statistical power to detect meaningful disparities in model performance across subgroups, avoiding both false negatives that miss real bias and false positives from multiple testing without appropriate correction.
Integrate bias detection into production model development pipelines through automated testing frameworks that surface fairness concerns before deployment, with clear escalation paths when bias is detected.

13.1 Introduction: Why Bias Detection Requires Moving Beyond Aggregate Metrics

Healthcare AI systems have demonstrated remarkable performance when evaluated using standard aggregate metrics like overall accuracy, area under the receiver operating characteristic curve, or mean squared error. Yet multiple high-profile examples have revealed how models achieving strong average performance can exhibit severe disparities across patient subgroups, systematically failing the very populations most in need of equitable care. The algorithm widely used for referral to intensive care management programs systematically underestimated illness severity for Black patients compared to white patients with equivalent actual health needs, leading to Black patients requiring substantially higher predicted costs before being selected for additional care. Pulse oximetry algorithms used to measure blood oxygen saturation, calibrated predominantly on patients with lighter skin tones, produced less accurate readings for patients with darker skin, potentially delaying recognition of hypoxemia and appropriate clinical interventions. Commercial risk prediction models for hospital readmission exhibited worse calibration for patients from neighborhoods with high poverty rates, producing unreliable probability estimates that could mislead clinical decision-making about post-discharge support needs.

These failures share common characteristics that aggregate performance metrics fail to capture. Models may achieve strong average performance by optimizing primarily for majority populations while accepting poor performance for underrepresented groups. Calibration may hold on average while varying substantially across subgroups, meaning predicted probabilities are reliable for some patients but misleading for others. Feature importance and contribution patterns may differ across demographic groups in ways that reflect problematic learned associations rather than legitimate medical relationships. Models may rely on proxy variables that encode social determinants of health or structural barriers to care rather than actual clinical need, systematically disadvantaging patients facing these barriers.

Detecting these disparities requires moving beyond aggregate evaluation to detailed examination of model behavior across patient subgroups defined by protected attributes including race, ethnicity, sex, age, language, insurance status, and geographic location, social determinants of health including neighborhood poverty, housing instability, food insecurity, and access to transportation, care setting characteristics including safety-net versus private hospitals, academic medical centers versus community health centers, and rural versus urban locations, and intersectional identities recognizing that patients may face compounded disadvantages based on combinations of demographic characteristics and social circumstances.

This chapter develops comprehensive approaches for bias detection and measurement in healthcare AI, beginning with theoretical foundations of fairness metrics and their applicability to clinical contexts, then progressing through practical implementations of bias testing frameworks, techniques for detecting proxy discrimination and analyzing feature attribution patterns, methods for auditing models for harmful associations, approaches for designing adequately powered validation studies, and integration of bias detection into production development pipelines. Throughout, we emphasize that bias detection is necessary but not sufficient for achieving fairness. Identifying disparities is only the first step toward addressing them through improved data collection, modified modeling approaches, or constrained optimization that explicitly prioritizes fairness alongside accuracy.

The stakes are particularly high in healthcare applications affecting underserved populations. When models fail to perform equitably, they can perpetuate and even amplify existing disparities in access to care, quality of treatment, and health outcomes. Patients who already face substantial barriers to healthcare access may be further disadvantaged by biased algorithms that misestimate their clinical needs, misdirect limited resources away from those most in need, or encode discriminatory patterns from historical data that reflect systemic racism and structural inequities rather than medical reality. Rigorous bias detection is therefore not merely a technical exercise but an ethical imperative in the pursuit of health equity.

13.2 Fairness Metrics and Their Healthcare Applications

Formalizing fairness requires translating intuitive notions of equitable treatment into mathematical definitions that can be measured and optimized. However, fairness is not a single concept but rather a family of related but distinct requirements that often conflict with each other, requiring careful consideration of which fairness criteria are most appropriate for specific healthcare applications. This section develops the major fairness metrics used in machine learning and analyzes their applicability, limitations, and tradeoffs in clinical contexts.

13.2.1 Theoretical Foundations of Fairness Metrics

We begin by establishing notation and core concepts. Consider a predictive model that takes patient features $X$ and produces predictions $\hat{Y}$ for some outcome of interest $Y$. Patients are characterized by protected attributes $A$ that might include race, ethnicity, sex, age, or other demographic characteristics we wish to ensure receive fair treatment. In healthcare, we often care about fairness with respect to multiple protected attributes simultaneously and must consider intersectional identities where patients belong to multiple potentially disadvantaged groups.

The fundamental tension in fairness metrics arises from different intuitions about what constitutes fair treatment. One perspective emphasizes treating similar individuals similarly regardless of protected attributes, which leads naturally to fairness through unawareness, requiring that protected attributes not directly influence predictions. However, fairness through unawareness is widely recognized as insufficient because models can rely on features correlated with protected attributes to effectively use information about group membership even when the protected attributes themselves are excluded from model inputs. Furthermore, in healthcare contexts, demographic characteristics may have legitimate clinical relevance for example, sex affects disease presentation and medication metabolism through biological mechanisms that should inform clinical decision-making, making complete unawareness both infeasible and potentially harmful.

An alternative perspective emphasizes equalizing outcomes or treatment across groups defined by protected attributes, leading to statistical parity or demographic parity metrics that require equal rates of positive predictions across groups. Yet another perspective focuses on equalizing error rates across groups, leading to equalized odds or equal opportunity metrics that require similar false positive and false negative rates. Finally, calibration-based fairness emphasizes that predicted probabilities should be reliable within each group, meaning that among patients predicted to have a given probability of an outcome, the actual outcome rate should match the prediction regardless of group membership.

These different fairness notions formalize distinct ethical principles about what constitutes fair treatment, and they often conflict with each other in ways that require explicit choices about priorities. Understanding these metrics, their relationships, and their implications for healthcare applications is essential for developing models that advance rather than undermine health equity.

Demographic parity, also called statistical parity or independence, requires that the prediction $\hat{Y}$ be independent of the protected attribute $A$. Formally: $P(\hat{Y}=1\lvert A=a) = P(\hat{Y}=1 \rvert A=a')$ for all groups $a, a'$. This means that positive predictions should occur at equal rates across groups regardless of their membership in protected classes. In a healthcare context, demographic parity would require that a model predicting referral to a care management program selects patients from different racial groups at equal rates.

The appeal of demographic parity is its simplicity and its alignment with anti-discrimination law in many jurisdictions, which prohibits disparate impact where policies with facially neutral rules have substantially different effects on protected groups. However, demographic parity has substantial limitations in healthcare applications. If the actual prevalence of a condition or the actual need for an intervention differs across groups due to factors including biological differences, social determinants of health affecting disease risk, or differential exposure to health-harming environmental conditions, then demographic parity would require either treating dissimilar patients similarly or allowing the model to be less accurate for some groups in order to equalize selection rates.

Consider a model predicting which patients should receive intensive blood pressure monitoring. If structural factors including food insecurity, housing instability, chronic stress from discrimination, and limited access to preventive care lead to genuinely higher rates of poorly controlled hypertension in certain populations, demographic parity would prevent the model from appropriately directing more monitoring resources to these higher-need populations. This creates a fundamental tension where demographic parity can conflict with the goal of directing healthcare resources toward those with greatest medical need.

Despite these limitations, demographic parity remains relevant in healthcare contexts where there is strong reason to believe that outcome prevalence should not differ across groups but observed differences reflect measurement bias or structural barriers rather than true differences in clinical need. The algorithm that systematically underestimated illness severity for Black patients provided a compelling example where demographic parity was appropriate. The algorithm used healthcare costs as a proxy for illness severity, and demographic parity in predicted costs would have revealed that Black and white patients with equivalent actual health needs had systematically different predicted costs due to Black patients receiving less intensive care for the same conditions, leading to lower costs that reflected discrimination in care delivery rather than differences in health needs.

Equalized odds, also called separation, requires that the prediction $\hat{Y}$ be independent of the protected attribute $A$ conditional on the true outcome $Y$. Formally: $P(\hat{Y}=1\lvert A=a, Y=y) = P(\hat{Y}=1 \rvert A=a', Y=y)$ for all groups $a, a'$ and outcomes $y$. This means that the true positive rate (sensitivity) and false positive rate (1 minus specificity) must be equal across groups. Patients with the same true outcome should have equal probability of being predicted to have that outcome regardless of their group membership.

In healthcare applications, equalized odds has strong appeal because it ensures that the model’s errors are distributed equally across groups. A diagnostic model satisfying equalized odds would have equal sensitivity for detecting disease in different racial groups, ensuring that all patients with the condition have equal probability of being correctly diagnosed regardless of race. Similarly, equal false positive rates ensure that healthy patients from different groups face equal risk of being incorrectly diagnosed and subjected to unnecessary follow-up testing or treatment.

However, equalized odds can be challenging to achieve in practice, particularly when groups differ in their distribution of features or when the optimal decision boundaries differ across groups due to heterogeneous treatment effects or risk-benefit tradeoffs. Recent theoretical work has established that perfect equalized odds is generally impossible when groups have different feature distributions, and attempting to enforce it can require substantial accuracy sacrifices that may ultimately harm all patients. Furthermore, equal error rates do not guarantee equal outcomes. A model with equal sensitivity across groups but substantial differences in disease prevalence would lead to very different numbers of patients correctly diagnosed in each group, raising questions about whether equalized odds alone ensures fairness.

Equal opportunity is a relaxation of equalized odds that requires only equal true positive rates across groups, allowing false positive rates to differ. Formally: $P(\hat{Y}=1\lvert A=a, Y=1) = P(\hat{Y}=1 \rvert A=a', Y=1)$ for all groups $a, a'$. This metric focuses specifically on ensuring that patients who actually have the condition or need the intervention receive equal probability of being correctly identified regardless of group membership.

For healthcare applications, equal opportunity is particularly appealing when false negatives are substantially more harmful than false positives. A model predicting need for cancer screening that satisfies equal opportunity would have equal sensitivity for identifying patients who actually have cancer across racial groups, ensuring that all patients with cancer have equal probability of being detected and treated regardless of race. If false positives result in additional screening that carries low risk and moderate cost, while false negatives lead to delayed cancer diagnosis with potentially fatal consequences, we may accept differential false positive rates to ensure equal true positive rates across groups.

Calibration within groups requires that predicted probabilities be reliable within each group defined by protected attributes. Formally, a model is calibrated within group $a$ if $P(Y=1\mid\hat{P}=p, A=a) = p$ for all predicted probabilities $p$. This means that among patients in group $a$ who are assigned a predicted probability $p$, the actual outcome rate should equal $p$.

Calibration has particular importance in clinical decision-making because predicted probabilities directly inform treatment decisions through expected utility calculations that balance benefits and harms of interventions. When a model predicts a 30 percent probability that a patient will experience an adverse event, clinicians use this probability estimate to decide whether the expected benefits of a preventive intervention outweigh its risks and costs. If the model is poorly calibrated for certain patient subgroups, clinicians making treatment decisions for these patients are working with misleading probability estimates that can lead to both overtreatment of low-risk patients and undertreatment of high-risk patients.

Research has documented numerous examples of models exhibiting good overall calibration but poor calibration within demographic subgroups. Risk prediction models may be well-calibrated on average because overestimation for some groups and underestimation for others cancel out in aggregate evaluation, while producing unreliable probability estimates for any specific group. This form of miscalibration can systematically disadvantage certain populations by misdirecting clinical resources based on inaccurate risk estimates.

However, calibration alone is insufficient to ensure fairness. A model that always predicts base rate probabilities for each group (predicting 10 percent risk for all members of a group with 10 percent outcome prevalence) achieves perfect calibration within groups but provides no individual-level discrimination and would be clinically useless. Furthermore, calibration can be satisfied while substantial disparities exist in other fairness metrics. Recent theoretical work has proven that calibration, separation (equalized odds), and independence (demographic parity) cannot all be satisfied simultaneously except in degenerate cases, requiring explicit prioritization among these competing fairness criteria.

Predictive value parity focuses on positive predictive value and negative predictive value being equal across groups. Formally: $P(Y=1\lvert \hat{Y}=1, A=a) = P(Y=1 \rvert\hat{Y}=1, A=a')$ and $P(Y=0\lvert \hat{Y}=0, A=a) = P(Y=0 \rvert\hat{Y}=0, A=a')$ for all groups $a, a'$. This means that among patients predicted to be positive, the actual positive rate should be equal across groups, and similarly for negative predictions.

Predictive value parity has intuitive appeal because it ensures that a positive prediction means the same thing regardless of patient characteristics. A model predicting likelihood of hospital readmission that satisfies predictive value parity would have equal positive predictive value across racial groups, meaning that among patients predicted to be high risk for readmission, the actual readmission rate is the same whether the patient is Black, white, Hispanic, or Asian. This ensures that interventions targeted based on model predictions would serve populations with equal need across different racial groups.

However, predictive value parity is strongly influenced by outcome prevalence, which often differs across populations due to social determinants of health. When outcome prevalence differs across groups, achieving equal predictive values requires either accepting different sensitivity or specificity across groups or sacrificing overall accuracy. This creates tensions similar to those faced with demographic parity when actual need differs across populations.

13.2.2 Impossibility Results and Necessary Tradeoffs

Theoretical analysis has established fundamental impossibility results demonstrating that multiple fairness criteria cannot be simultaneously satisfied except under restrictive conditions. The most consequential result proves that calibration within groups, separation (equalized odds), and independence (demographic parity) cannot all hold simultaneously unless the outcome is independent of both features and group membership (a degenerate case) or groups have identical feature and outcome distributions (eliminating any meaningful fairness concerns).

These impossibility results have profound implications for healthcare AI development. They establish that perfect fairness according to all reasonable definitions is mathematically impossible, requiring explicit choices about which fairness criteria to prioritize for specific applications. These choices involve value judgments about what constitutes fair treatment in healthcare contexts and cannot be resolved through technical methods alone.

Consider a model predicting risk of cardiovascular disease used to determine eligibility for preventive statin therapy. Suppose cardiovascular disease risk genuinely differs between men and women at given ages due to both biological factors and differential exposure to risk factors. A model that is well-calibrated within sex groups will necessarily violate demographic parity, predicting higher risk for men than women at similar ages if men actually have higher risk. A model enforcing demographic parity to predict equal rates of high risk across sexes would be miscalibrated for at least one sex, producing unreliable probability estimates that could lead to both overtreatment and undertreatment.

The impossibility results do not mean that fairness is unachievable but rather that we must make explicit choices about priorities. For many healthcare applications, calibration within groups is essential because clinical decisions rely on accurate probability estimates. Models that violate calibration produce misleading risk estimates that can lead to harmful treatment decisions regardless of whether other fairness criteria are satisfied. Equalized odds ensures that errors are distributed equally across groups, which has strong ethical justification when we cannot perfectly predict outcomes and want error burdens to be borne equally. Demographic parity remains important in contexts where outcome rates should be equal across groups or where observed differences reflect measurement bias or discrimination rather than true differences in need.

Different healthcare applications may reasonably prioritize different fairness criteria based on the clinical context, the nature of the decision being supported, the relative harms of different types of errors, and empirical evidence about whether observed outcome differences across groups reflect true differences in clinical need versus artifacts of discrimination or measurement bias. The key is making these prioritization decisions explicitly and transparently rather than optimizing for overall accuracy while ignoring fairness concerns.

13.2.3 Production Implementation: Comprehensive Fairness Metrics

We now implement a production-grade framework for computing multiple fairness metrics across patient subgroups, supporting both binary classification and probability prediction tasks with extensive documentation and error handling.

import numpy as np
import pandas as pd
from typing import Dict, List, Optional, Tuple, Union
from dataclasses import dataclass
from collections import defaultdict
import warnings
from scipy import stats
from sklearn.metrics import (
    confusion_matrix,
    roc_auc_score,
    brier_score_loss,
    calibration_curve
)
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@dataclass
class FairnessMetrics:
    """Container for fairness metric results."""

    group: str
    demographic_parity_diff: float
    equalized_odds_diff: float
    equal_opportunity_diff: float
    predictive_parity_diff: float
    calibration_error: float
    auc: float
    brier_score: float

    # Confusion matrix elements
    true_positives: int
    true_negatives: int
    false_positives: int
    false_negatives: int

    # Rates
    positive_rate: float
    true_positive_rate: float  # sensitivity/recall
    true_negative_rate: float  # specificity
    false_positive_rate: float
    false_negative_rate: float
    positive_predictive_value: float  # precision
    negative_predictive_value: float

    # Sample sizes
    n_total: int
    n_positive: int
    n_negative: int

    # Calibration details
    calibration_bins: Optional[np.ndarray] = None
    calibration_fractions: Optional[np.ndarray] = None
    calibration_bin_means: Optional[np.ndarray] = None

class FairnessEvaluator:
    """
    Comprehensive fairness evaluation framework for healthcare models.

    Computes multiple fairness metrics across demographic subgroups
    including demographic parity, equalized odds, calibration, and
    predictive parity. Provides detailed analysis of where and how
    models exhibit bias.

    This implementation is designed for production use with:
    - Extensive input validation and error handling
    - Clear warnings about statistical power and interpretation
    - Detailed logging of computation steps
    - Support for both binary predictions and probability scores
    - Stratified analysis across intersectional identities
    """

    def __init__(
        self,
        min_group_size: int = 50,
        alpha: float = 0.05,
        calibration_bins: int = 10,
        bootstrap_iterations: int = 1000,
        random_state: int = 42
    ):
        """
        Initialize fairness evaluator.

        Parameters
        ----------
        min_group_size : int, default=50
            Minimum sample size for a subgroup to be evaluated.
            Groups smaller than this trigger warnings about limited
            statistical power.
        alpha : float, default=0.05
            Significance level for statistical tests
        calibration_bins : int, default=10
            Number of bins for calibration curve estimation
        bootstrap_iterations : int, default=1000
            Number of bootstrap samples for confidence intervals
        random_state : int, default=42
            Random seed for reproducibility
        """
        self.min_group_size = min_group_size
        self.alpha = alpha
        self.calibration_bins = calibration_bins
        self.bootstrap_iterations = bootstrap_iterations
        self.random_state = random_state

        self.results: Dict[str, FairnessMetrics] = {}
        self.warnings: List[str] = []

    def evaluate(
        self,
        y_true: np.ndarray,
        y_pred: np.ndarray,
        y_prob: Optional[np.ndarray] = None,
        protected_attributes: Optional[pd.DataFrame] = None,
        group_names: Optional[List[str]] = None
    ) -> Dict[str, FairnessMetrics]:
        """
        Compute comprehensive fairness metrics across demographic groups.

        Parameters
        ----------
        y_true : np.ndarray
            True binary labels (0 or 1)
        y_pred : np.ndarray
            Predicted binary labels (0 or 1)
        y_prob : np.ndarray, optional
            Predicted probabilities for positive class.
            Required for calibration analysis.
        protected_attributes : pd.DataFrame, optional
            DataFrame where each column is a protected attribute
            (e.g., 'race', 'sex', 'age_group') and rows correspond
            to samples. If None, computes overall metrics only.
        group_names : List[str], optional
            Names of groups to analyze. If None, analyzes all
            unique values in protected_attributes.

        Returns
        -------
        Dict[str, FairnessMetrics]
            Dictionary mapping group names to fairness metrics
        """
        # Input validation
        self._validate_inputs(y_true, y_pred, y_prob, protected_attributes)

        # Compute overall metrics as baseline
        overall_metrics = self._compute_metrics(
            y_true, y_pred, y_prob, "Overall"
        )
        self.results["Overall"] = overall_metrics

        # If no protected attributes provided, return overall only
        if protected_attributes is None:
            logger.warning(
                "No protected attributes provided. "
                "Computing overall metrics only."
            )
            return self.results

        # Compute metrics for each demographic group
        for col in protected_attributes.columns:
            unique_groups = protected_attributes[col].unique()

            for group_val in unique_groups:
                group_mask = protected_attributes[col] == group_val
                group_label = f"{col}={group_val}"

                # Check minimum sample size
                n_group = group_mask.sum()
                if n_group < self.min_group_size:
                    warning_msg = (
                        f"Group '{group_label}' has only {n_group} samples, "
                        f"which is below minimum of {self.min_group_size}. "
                        "Results may have limited statistical power."
                    )
                    self.warnings.append(warning_msg)
                    logger.warning(warning_msg)

                # Compute metrics for this group
                group_metrics = self._compute_metrics(
                    y_true[group_mask],
                    y_pred[group_mask],
                    y_prob[group_mask] if y_prob is not None else None,
                    group_label
                )
                self.results[group_label] = group_metrics

        # Compute fairness disparities
        self._compute_fairness_disparities()

        logger.info(
            f"Completed fairness evaluation for {len(self.results)} groups"
        )

        return self.results

    def _validate_inputs(
        self,
        y_true: np.ndarray,
        y_pred: np.ndarray,
        y_prob: Optional[np.ndarray],
        protected_attributes: Optional[pd.DataFrame]
    ) -> None:
        """Validate input arrays and raise informative errors."""

        # Check y_true and y_pred same length
        if len(y_true) != len(y_pred):
            raise ValueError(
                f"y_true and y_pred must have same length. "
                f"Got {len(y_true)} and {len(y_pred)}"
            )

        # Check binary values
        if not np.all(np.isin(y_true, [0, 1])):
            raise ValueError("y_true must contain only 0 and 1")
        if not np.all(np.isin(y_pred, [0, 1])):
            raise ValueError("y_pred must contain only 0 and 1")

        # Check y_prob if provided
        if y_prob is not None:
            if len(y_prob) != len(y_true):
                raise ValueError(
                    f"y_prob must have same length as y_true. "
                    f"Got {len(y_prob)} and {len(y_true)}"
                )
            if not np.all((y_prob >= 0) & (y_prob <= 1)):
                raise ValueError("y_prob must contain probabilities in [0, 1]")

        # Check protected_attributes if provided
        if protected_attributes is not None:
            if len(protected_attributes) != len(y_true):
                raise ValueError(
                    f"protected_attributes must have same length as y_true. "
                    f"Got {len(protected_attributes)} and {len(y_true)}"
                )

    def _compute_metrics(
        self,
        y_true: np.ndarray,
        y_pred: np.ndarray,
        y_prob: Optional[np.ndarray],
        group_label: str
    ) -> FairnessMetrics:
        """
        Compute all metrics for a single group.

        Parameters
        ----------
        y_true : np.ndarray
            True labels for this group
        y_pred : np.ndarray
            Predicted labels for this group
        y_prob : np.ndarray, optional
            Predicted probabilities for this group
        group_label : str
            Name of this group for logging

        Returns
        -------
        FairnessMetrics
            Computed metrics for this group
        """
        # Compute confusion matrix
        tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

        n_total = len(y_true)
        n_positive = (y_true == 1).sum()
        n_negative = (y_true == 0).sum()

        # Compute rates with division by zero handling
        def safe_divide(numerator, denominator, default=0.0):
            return numerator / denominator if denominator > 0 else default

        positive_rate = (y_pred == 1).sum() / n_total
        true_positive_rate = safe_divide(tp, tp + fn)
        true_negative_rate = safe_divide(tn, tn + fp)
        false_positive_rate = safe_divide(fp, fp + tn)
        false_negative_rate = safe_divide(fn, fn + tp)
        positive_predictive_value = safe_divide(tp, tp + fp)
        negative_predictive_value = safe_divide(tn, tn + fn)

        # Compute AUC and Brier score if probabilities provided
        if y_prob is not None and len(np.unique(y_true)) > 1:
            try:
                auc = roc_auc_score(y_true, y_prob)
                brier = brier_score_loss(y_true, y_prob)
            except ValueError as e:
                logger.warning(
                    f"Could not compute AUC/Brier for {group_label}: {e}"
                )
                auc = np.nan
                brier = np.nan
        else:
            auc = np.nan
            brier = np.nan

        # Compute calibration if probabilities provided
        calibration_error = np.nan
        calibration_fracs = None
        calibration_bin_means = None
        calibration_bins_array = None

        if y_prob is not None:
            try:
                fraction_of_positives, mean_predicted_value = calibration_curve(
                    y_true, y_prob, n_bins=self.calibration_bins, strategy='uniform'
                )

                # Expected calibration error (ECE)
                # Weight each bin by proportion of samples
                bin_indices = np.digitize(
                    y_prob,
                    bins=np.linspace(0, 1, self.calibration_bins + 1)[1:-1]
                )
                bin_counts = np.bincount(
                    bin_indices, minlength=self.calibration_bins
                )
                bin_weights = bin_counts / len(y_prob)

                calibration_error = np.sum(
                    bin_weights[:len(fraction_of_positives)] *
                    np.abs(fraction_of_positives - mean_predicted_value)
                )

                calibration_fracs = fraction_of_positives
                calibration_bin_means = mean_predicted_value
                calibration_bins_array = bin_counts

            except (ValueError, IndexError) as e:
                logger.warning(
                    f"Could not compute calibration for {group_label}: {e}"
                )

        return FairnessMetrics(
            group=group_label,
            demographic_parity_diff=np.nan,  # Computed later
            equalized_odds_diff=np.nan,  # Computed later
            equal_opportunity_diff=np.nan,  # Computed later
            predictive_parity_diff=np.nan,  # Computed later
            calibration_error=calibration_error,
            auc=auc,
            brier_score=brier,
            true_positives=int(tp),
            true_negatives=int(tn),
            false_positives=int(fp),
            false_negatives=int(fn),
            positive_rate=positive_rate,
            true_positive_rate=true_positive_rate,
            true_negative_rate=true_negative_rate,
            false_positive_rate=false_positive_rate,
            false_negative_rate=false_negative_rate,
            positive_predictive_value=positive_predictive_value,
            negative_predictive_value=negative_predictive_value,
            n_total=n_total,
            n_positive=int(n_positive),
            n_negative=int(n_negative),
            calibration_bins=calibration_bins_array,
            calibration_fractions=calibration_fracs,
            calibration_bin_means=calibration_bin_means
        )

    def _compute_fairness_disparities(self) -> None:
        """
        Compute fairness disparity metrics by comparing each group to overall.

        Updates FairnessMetrics objects in self.results with disparity measures.
        """
        if "Overall" not in self.results:
            logger.warning("No overall metrics found, cannot compute disparities")
            return

        overall = self.results["Overall"]

        for group_label, metrics in self.results.items():
            if group_label == "Overall":
                continue

            # Demographic parity: difference in positive prediction rates
            metrics.demographic_parity_diff = (
                metrics.positive_rate - overall.positive_rate
            )

            # Equalized odds: max difference in TPR and FPR
            tpr_diff = abs(
                metrics.true_positive_rate - overall.true_positive_rate
            )
            fpr_diff = abs(
                metrics.false_positive_rate - overall.false_positive_rate
            )
            metrics.equalized_odds_diff = max(tpr_diff, fpr_diff)

            # Equal opportunity: difference in TPR only
            metrics.equal_opportunity_diff = abs(
                metrics.true_positive_rate - overall.true_positive_rate
            )

            # Predictive parity: difference in PPV
            metrics.predictive_parity_diff = abs(
                metrics.positive_predictive_value -
                overall.positive_predictive_value
            )

    def get_summary_report(self, max_disparity_threshold: float = 0.05) -> str:
        """
        Generate human-readable summary report of fairness evaluation.

        Parameters
        ----------
        max_disparity_threshold : float, default=0.05
            Maximum acceptable disparity before flagging as concerning.
            Common thresholds are 0.05 (5%) or 0.10 (10%).

        Returns
        -------
        str
            Formatted summary report
        """
        lines = ["=" * 80]
        lines.append("FAIRNESS EVALUATION SUMMARY REPORT")
        lines.append("=" * 80)
        lines.append("")

        # Overall metrics
        if "Overall" in self.results:
            overall = self.results["Overall"]
            lines.append("OVERALL PERFORMANCE")
            lines.append(f"  Total samples: {overall.n_total}")
            lines.append(f"  Positive rate: {overall.positive_rate:.3f}")
            lines.append(f"  AUC: {overall.auc:.3f}")
            lines.append(f"  Brier score: {overall.brier_score:.3f}")
            lines.append(f"  Calibration error: {overall.calibration_error:.3f}")
            lines.append(f"  Sensitivity (TPR): {overall.true_positive_rate:.3f}")
            lines.append(f"  Specificity (TNR): {overall.true_negative_rate:.3f}")
            lines.append(f"  PPV: {overall.positive_predictive_value:.3f}")
            lines.append("")

        # Group-wise metrics
        lines.append("GROUP-WISE FAIRNESS METRICS")
        lines.append("-" * 80)

        flagged_groups = []

        for group_label, metrics in self.results.items():
            if group_label == "Overall":
                continue

            lines.append(f"\n{group_label}")
            lines.append(f"  Sample size: {metrics.n_total}")
            lines.append(f"  Demographic parity disparity: "
                        f"{metrics.demographic_parity_diff:+.3f}")
            lines.append(f"  Equalized odds disparity: "
                        f"{metrics.equalized_odds_diff:.3f}")
            lines.append(f"  Equal opportunity disparity: "
                        f"{metrics.equal_opportunity_diff:.3f}")
            lines.append(f"  Predictive parity disparity: "
                        f"{metrics.predictive_parity_diff:.3f}")
            lines.append(f"  Calibration error: {metrics.calibration_error:.3f}")

            # Flag groups exceeding threshold
            max_disparity = max([
                abs(metrics.demographic_parity_diff),
                metrics.equalized_odds_diff,
                metrics.equal_opportunity_diff,
                metrics.predictive_parity_diff
            ])

            if max_disparity > max_disparity_threshold:
                flagged_groups.append((group_label, max_disparity))
                lines.append(f"  ⚠️  FLAGGED: Disparity exceeds threshold")

        # Summary of flagged groups
        if flagged_groups:
            lines.append("\n" + "=" * 80)
            lines.append("GROUPS EXCEEDING FAIRNESS THRESHOLD")
            lines.append("=" * 80)
            for group, disparity in sorted(
                flagged_groups, key=lambda x: x[1], reverse=True
            ):
                lines.append(f"  {group}: max disparity = {disparity:.3f}")
        else:
            lines.append("\n" + "=" * 80)
            lines.append("✓ All groups within fairness threshold")
            lines.append("=" * 80)

        # Warnings
        if self.warnings:
            lines.append("\n" + "=" * 80)
            lines.append("WARNINGS")
            lines.append("=" * 80)
            for warning in self.warnings:
                lines.append(f"  • {warning}")

        return "\n".join(lines)

    def plot_fairness_metrics(
        self,
        metric_name: str = "equalized_odds_diff",
        figsize: Tuple[int, int] = (12, 6)
    ):
        """
        Create visualization of fairness metrics across groups.

        Parameters
        ----------
        metric_name : str
            Name of metric to plot. Options:
            - 'demographic_parity_diff'
            - 'equalized_odds_diff'
            - 'equal_opportunity_diff'
            - 'predictive_parity_diff'
            - 'calibration_error'
        figsize : Tuple[int, int]
            Figure size
        """
        try:
            import matplotlib.pyplot as plt
        except ImportError:
            logger.error("matplotlib required for plotting")
            return

        groups = []
        values = []

        for group_label, metrics in self.results.items():
            if group_label == "Overall":
                continue
            groups.append(group_label)
            values.append(getattr(metrics, metric_name))

        fig, ax = plt.subplots(figsize=figsize)
        ax.barh(groups, values)
        ax.axvline(x=0, color='black', linestyle='--', linewidth=0.5)
        ax.axvline(x=0.05, color='red', linestyle=':', linewidth=1, alpha=0.5)
        ax.axvline(x=-0.05, color='red', linestyle=':', linewidth=1, alpha=0.5)
        ax.set_xlabel(metric_name.replace('_', ' ').title())
        ax.set_title(f'Fairness Metric: {metric_name.replace("_", " ").title()}')
        ax.grid(axis='x', alpha=0.3)

        plt.tight_layout()
        return fig

This fairness evaluation framework provides comprehensive bias detection across multiple metrics with extensive documentation and warnings about statistical power. The implementation computes demographic parity, equalized odds, equal opportunity, predictive parity, and calibration metrics for each demographic subgroup while maintaining full transparency about sample sizes and statistical limitations. The framework generates human-readable reports that surface concerning disparities while providing sufficient detail for informed interpretation.

The key design choices reflect lessons from deploying fairness evaluation in production healthcare systems. First, explicit minimum sample size requirements with warnings ensure that practitioners understand when results may have limited statistical power rather than silently reporting potentially misleading metrics for small subgroups. Second, computing multiple fairness metrics acknowledges that different applications may prioritize different fairness criteria and that comprehensive evaluation requires examining model behavior from multiple perspectives. Third, comparison to overall performance provides intuitive benchmarks for interpreting group-specific metrics while acknowledging that overall performance may itself be problematic if it optimizes primarily for majority populations. Fourth, detailed confusion matrix elements and rates enable practitioners to understand not just whether disparities exist but their clinical implications in terms of sensitivity, specificity, and predictive values that directly inform healthcare decisions.

13.3 Detecting Proxy Discrimination and Feature Analysis

Even when protected attributes are excluded from model inputs, models can exhibit bias through proxy discrimination where they rely on features correlated with protected attributes to make predictions. This section develops methods for detecting proxy discrimination through feature importance analysis, causal mediation techniques, and systematic probing of model associations between patient characteristics and predictions.

13.3.1 Understanding Proxy Discrimination in Healthcare Data

Healthcare data contains numerous features that correlate with race, ethnicity, socioeconomic status, and other protected attributes while also having legitimate clinical associations with outcomes. These correlations create pathways for proxy discrimination even when models do not directly use protected attributes. The challenge is distinguishing between problematic proxy discrimination that perpetuates bias and legitimate medical associations that should inform clinical predictions.

Consider a model predicting risk of kidney disease that uses estimated glomerular filtration rate (eGFR) calculated from serum creatinine. Historically, eGFR equations included race-based adjustment terms that systematically inflated eGFR estimates for Black patients, delaying diagnosis and treatment of kidney disease in this population. These race-based adjustments encoded the false belief that Black patients naturally have higher muscle mass and thus higher baseline creatinine levels, ignoring that observed differences reflected environmental factors, nutrition, and socioeconomic circumstances rather than innate biological differences. When models use eGFR as a feature without accounting for these race-based adjustments, they indirectly encode racial bias through a clinical measurement that appears objective but contains embedded discrimination.

Healthcare utilization patterns provide another common source of proxy discrimination. Models predicting healthcare needs often use prior utilization including number of emergency department visits, hospitalizations, or outpatient appointments as features based on the reasonable assumption that patients with greater prior utilization have higher clinical needs. However, healthcare utilization reflects not only clinical need but also access barriers, insurance coverage, cultural factors affecting healthcare-seeking behavior, experiences of discrimination in healthcare settings, and transportation availability. Patients from underserved communities may have lower utilization despite equal or greater clinical need due to structural barriers rather than better health. Using utilization features as proxies for need can systematically underestimate severity for underserved populations.

Neighborhood characteristics including ZIP code, census tract poverty rates, or area deprivation indices are increasingly included in healthcare prediction models to account for social determinants of health. While these features can improve prediction accuracy by capturing important social context affecting health, they also create pathways for proxy discrimination. Neighborhood characteristics are highly correlated with race and ethnicity due to residential segregation resulting from historical redlining, ongoing discrimination in housing and lending, and economic inequality. Models using neighborhood features may inadvertently encode racial disparities in resource allocation, environmental exposures, or healthcare access rather than capturing clinically meaningful risk factors that should inform care.

Detecting proxy discrimination requires analyzing which features the model relies on for predictions, whether feature importance differs across demographic groups, and whether removing potentially problematic features or modifying how they are used reduces disparate impacts without substantially harming overall accuracy. This analysis must be grounded in clinical and social context that distinguishes between legitimate medical associations and problematic proxies for protected attributes.

13.3.2 Production Implementation: Feature Importance and Contribution Analysis

We implement a comprehensive framework for analyzing feature importance and contributions across demographic subgroups, using multiple complementary techniques to provide robust detection of proxy discrimination patterns.

import numpy as np
import pandas as pd
from typing import Dict, List, Optional, Tuple, Callable
from sklearn.inspection import permutation_importance
from sklearn.base import BaseEstimator
import shap
from dataclasses import dataclass
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@dataclass
class FeatureImportanceResults:
    """Container for feature importance analysis results."""

    feature_name: str
    overall_importance: float
    group_importances: Dict[str, float]
    max_disparity: float
    disparity_ratio: float
    potentially_problematic: bool
    notes: List[str]

class ProxyDiscriminationDetector:
    """
    Framework for detecting proxy discrimination through feature analysis.

    Analyzes whether models rely on features differently across demographic
    groups, which may indicate proxy discrimination even when protected
    attributes are not directly used as model inputs.

    Implements multiple complementary approaches:
    - Permutation importance stratified by demographics
    - SHAP value analysis across groups
    - Contribution distribution comparisons
    - Correlation analysis between features and protected attributes
    """

    def __init__(
        self,
        model: BaseEstimator,
        feature_names: List[str],
        disparity_threshold: float = 0.2,
        n_permutations: int = 10,
        random_state: int = 42
    ):
        """
        Initialize proxy discrimination detector.

        Parameters
        ----------
        model : BaseEstimator
            Trained model to analyze
        feature_names : List[str]
            Names of features used by model
        disparity_threshold : float, default=0.2
            Threshold for flagging features with disparate importance.
            Features with importance ratio > 1 + threshold or < 1 - threshold
            across groups are flagged.
        n_permutations : int, default=10
            Number of permutations for importance estimation
        random_state : int, default=42
            Random seed for reproducibility
        """
        self.model = model
        self.feature_names = feature_names
        self.disparity_threshold = disparity_threshold
        self.n_permutations = n_permutations
        self.random_state = random_state

        self.results: Dict[str, FeatureImportanceResults] = {}
        self.warnings: List[str] = []

        # Check if model supports predict_proba for binary classification
        if hasattr(model, 'predict_proba'):
            self.scoring = 'neg_log_loss'
        else:
            self.scoring = 'accuracy'

    def analyze_feature_importance(
        self,
        X: pd.DataFrame,
        y: np.ndarray,
        protected_attributes: pd.DataFrame,
        potentially_problematic_features: Optional[List[str]] = None
    ) -> Dict[str, FeatureImportanceResults]:
        """
        Analyze feature importance overall and stratified by demographics.

        Parameters
        ----------
        X : pd.DataFrame
            Feature matrix
        y : np.ndarray
            Target labels
        protected_attributes : pd.DataFrame
            Protected attribute labels for stratification
        potentially_problematic_features : List[str], optional
            List of features that may serve as proxies (e.g., ZIP code,
            prior healthcare utilization). These receive additional scrutiny.

        Returns
        -------
        Dict[str, FeatureImportanceResults]
            Results for each feature
        """
        logger.info("Computing overall feature importance...")

        # Compute overall importance using permutation
        overall_importance = permutation_importance(
            self.model,
            X,
            y,
            n_repeats=self.n_permutations,
            random_state=self.random_state,
            scoring=self.scoring
        )

        # Store overall importance
        overall_importance_dict = {
            feature: overall_importance.importances_mean[i]
            for i, feature in enumerate(self.feature_names)
        }

        # Compute importance stratified by each protected attribute
        logger.info("Computing stratified feature importance...")

        for col in protected_attributes.columns:
            unique_groups = protected_attributes[col].unique()

            for group_val in unique_groups:
                group_mask = protected_attributes[col] == group_val
                group_label = f"{col}={group_val}"

                # Skip groups that are too small
                if group_mask.sum() < 50:
                    warning = (
                        f"Skipping group {group_label} with only "
                        f"{group_mask.sum()} samples"
                    )
                    self.warnings.append(warning)
                    logger.warning(warning)
                    continue

                # Compute importance for this group
                group_importance = permutation_importance(
                    self.model,
                    X[group_mask],
                    y[group_mask],
                    n_repeats=self.n_permutations,
                    random_state=self.random_state,
                    scoring=self.scoring
                )

                # Store group-specific importance
                for i, feature in enumerate(self.feature_names):
                    if feature not in self.results:
                        self.results[feature] = FeatureImportanceResults(
                            feature_name=feature,
                            overall_importance=overall_importance_dict[feature],
                            group_importances={},
                            max_disparity=0.0,
                            disparity_ratio=1.0,
                            potentially_problematic=False,
                            notes=[]
                        )

                    self.results[feature].group_importances[group_label] = (
                        group_importance.importances_mean[i]
                    )

        # Analyze disparities
        logger.info("Analyzing importance disparities...")
        self._analyze_disparities(potentially_problematic_features)

        return self.results

    def _analyze_disparities(
        self,
        potentially_problematic_features: Optional[List[str]] = None
    ) -> None:
        """
        Analyze importance disparities and flag concerning patterns.

        Parameters
        ----------
        potentially_problematic_features : List[str], optional
            Features to give special scrutiny
        """
        for feature_name, result in self.results.items():
            if len(result.group_importances) < 2:
                continue

            # Compute disparity metrics
            importances = list(result.group_importances.values())
            max_importance = max(importances)
            min_importance = min(importances)

            # Avoid division by zero
            if min_importance > 0:
                result.disparity_ratio = max_importance / min_importance
            else:
                result.disparity_ratio = np.inf

            result.max_disparity = max_importance - min_importance

            # Flag if disparity exceeds threshold
            if result.disparity_ratio > (1 + self.disparity_threshold):
                result.potentially_problematic = True
                result.notes.append(
                    f"Importance varies by {result.disparity_ratio:.2f}x "
                    "across demographic groups"
                )

            # Additional scrutiny for pre-specified problematic features
            if (potentially_problematic_features and
                feature_name in potentially_problematic_features):

                if result.disparity_ratio > 1.1:
                    result.potentially_problematic = True
                    result.notes.append(
                        "Feature flagged as potentially problematic proxy "
                        "and shows importance variation across groups"
                    )

                # Check correlation with protected attributes
                result.notes.append(
                    "Feature identified as potentially serving as proxy "
                    "for protected attributes - requires clinical review"
                )

    def analyze_shap_values(
        self,
        X: pd.DataFrame,
        protected_attributes: pd.DataFrame,
        background_samples: int = 100
    ) -> Dict[str, Dict[str, np.ndarray]]:
        """
        Compute SHAP values stratified by demographic groups.

        SHAP values provide individual-level feature attributions that
        explain how each feature contributed to each prediction. Analyzing
        SHAP value distributions across demographics can reveal if the
        model relies on features differently for different patient groups.

        Parameters
        ----------
        X : pd.DataFrame
            Feature matrix
        protected_attributes : pd.DataFrame
            Protected attributes for stratification
        background_samples : int, default=100
            Number of background samples for SHAP estimation

        Returns
        -------
        Dict[str, Dict[str, np.ndarray]]
            SHAP values organized by {group_label: {feature: values}}
        """
        logger.info("Computing SHAP values...")

        # Create SHAP explainer
        # For tree models, use TreeExplainer; otherwise use KernelExplainer
        try:
            if hasattr(self.model, 'tree_'):
                explainer = shap.TreeExplainer(self.model)
            else:
                # Sample background for kernel explainer
                background = shap.sample(X, background_samples)
                explainer = shap.KernelExplainer(
                    self.model.predict,
                    background
                )
        except Exception as e:
            logger.error(f"Could not create SHAP explainer: {e}")
            return {}

        shap_by_group = {}

        # Compute SHAP values for each demographic group
        for col in protected_attributes.columns:
            unique_groups = protected_attributes[col].unique()

            for group_val in unique_groups:
                group_mask = protected_attributes[col] == group_val
                group_label = f"{col}={group_val}"

                if group_mask.sum() < 50:
                    continue

                try:
                    # Compute SHAP values for this group
                    shap_values = explainer.shap_values(X[group_mask])

                    # Organize by feature
                    shap_by_group[group_label] = {
                        feature: shap_values[:, i]
                        for i, feature in enumerate(self.feature_names)
                    }

                except Exception as e:
                    logger.warning(
                        f"Could not compute SHAP values for {group_label}: {e}"
                    )

        return shap_by_group

    def detect_feature_correlations(
        self,
        X: pd.DataFrame,
        protected_attributes: pd.DataFrame,
        correlation_threshold: float = 0.3
    ) -> pd.DataFrame:
        """
        Detect correlations between features and protected attributes.

        High correlations indicate features that may serve as proxies
        for protected attributes even when the protected attributes
        themselves are not used as model inputs.

        Parameters
        ----------
        X : pd.DataFrame
            Feature matrix
        protected_attributes : pd.DataFrame
            Protected attributes
        correlation_threshold : float, default=0.3
            Threshold for flagging high correlations

        Returns
        -------
        pd.DataFrame
            Correlation matrix with flagged high correlations
        """
        # Combine features and protected attributes
        combined = pd.concat([X, protected_attributes], axis=1)

        # Compute correlations
        correlations = combined.corr()

        # Extract correlations between features and protected attributes
        feature_protected_corr = correlations.loc[
            self.feature_names,
            protected_attributes.columns
        ]

        # Flag high correlations
        high_corr = feature_protected_corr.abs() > correlation_threshold

        if high_corr.any().any():
            logger.warning(
                f"Found {high_corr.sum().sum()} feature-attribute pairs "
                f"with correlation > {correlation_threshold}"
            )

            for feature in self.feature_names:
                for attr in protected_attributes.columns:
                    if high_corr.loc[feature, attr]:
                        corr_value = feature_protected_corr.loc[feature, attr]
                        self.warnings.append(
                            f"Feature '{feature}' has correlation "
                            f"{corr_value:.3f} with '{attr}' - "
                            "may serve as proxy"
                        )

        return feature_protected_corr

    def generate_report(self) -> str:
        """Generate comprehensive proxy discrimination analysis report."""

        lines = ["=" * 80]
        lines.append("PROXY DISCRIMINATION ANALYSIS REPORT")
        lines.append("=" * 80)
        lines.append("")

        # Summary statistics
        n_features = len(self.results)
        n_flagged = sum(
            1 for r in self.results.values() if r.potentially_problematic
        )

        lines.append(f"Total features analyzed: {n_features}")
        lines.append(f"Features flagged as potentially problematic: {n_flagged}")
        lines.append("")

        # Detailed results for flagged features
        if n_flagged > 0:
            lines.append("FLAGGED FEATURES")
            lines.append("-" * 80)

            for feature_name, result in sorted(
                self.results.items(),
                key=lambda x: x[1].disparity_ratio,
                reverse=True
            ):
                if not result.potentially_problematic:
                    continue

                lines.append(f"\n{feature_name}")
                lines.append(f"  Overall importance: {result.overall_importance:.4f}")
                lines.append(f"  Disparity ratio: {result.disparity_ratio:.2f}x")
                lines.append(f"  Max disparity: {result.max_disparity:.4f}")
                lines.append("  Group-specific importances:")

                for group, importance in result.group_importances.items():
                    lines.append(f"    {group}: {importance:.4f}")

                if result.notes:
                    lines.append("  Notes:")
                    for note in result.notes:
                        lines.append(f"    - {note}")

        # Warnings
        if self.warnings:
            lines.append("\n" + "=" * 80)
            lines.append("WARNINGS AND RECOMMENDATIONS")
            lines.append("=" * 80)
            for warning in self.warnings:
                lines.append(f"  • {warning}")

        # Interpretation guidance
        lines.append("\n" + "=" * 80)
        lines.append("INTERPRETATION GUIDANCE")
        lines.append("=" * 80)
        lines.append(
            "Features with disparate importance across demographic groups may "
            "indicate proxy discrimination, where the model relies on different "
            "information for different patient populations. This warrants clinical "
            "review to determine if:"
        )
        lines.append("  1. The feature captures legitimate medical differences")
        lines.append("  2. The feature serves as a proxy for protected attributes")
        lines.append("  3. The feature reflects structural barriers to care")
        lines.append("")
        lines.append(
            "Features highly correlated with protected attributes should be "
            "examined for whether they introduce bias even if they also have "
            "clinical validity. Consider whether alternative features or "
            "preprocessing approaches could reduce proxy discrimination while "
            "maintaining clinical utility."
        )

        return "\n".join(lines)

This proxy discrimination detection framework provides multiple complementary perspectives on whether models rely on different features for different demographic groups. Permutation importance reveals which features matter most for predictions overall and within each subgroup. SHAP value analysis provides individual-level explanations that can be aggregated to understand typical contribution patterns. Correlation analysis identifies features that may serve as proxies for protected attributes based on their statistical associations rather than their use by the model.

The framework flags features exhibiting disparate importance across groups for further review while acknowledging that not all importance differences indicate problematic bias. Clinical expertise is essential for interpreting these results because some features may legitimately have different importance across populations due to biological differences, varying disease presentation patterns, or differential treatment effectiveness. The goal is surfacing concerning patterns for clinical review rather than automatically labeling features as problematic, which requires contextual judgment that cannot be fully automated.

13.4 Auditing Models for Harmful Associations

Beyond detecting disparities in aggregate metrics and feature importance patterns, comprehensive bias detection requires systematically probing models for specific harmful associations that reflect historical discrimination rather than legitimate medical relationships. This section develops approaches for auditing trained models using techniques including counterfactual analysis, systematic bias testing, and model probing experiments.

13.4.1 Counterfactual Analysis for Bias Detection

Counterfactual analysis examines how model predictions would change if specific patient attributes were different while holding all other features constant. For bias detection, we focus on counterfactuals that change protected attributes or features suspected to serve as proxies, asking whether predictions change in ways that suggest the model has learned discriminatory associations.

Consider a model predicting hospital readmission risk. Counterfactual analysis might ask how the predicted risk changes if we modify a patient’s ZIP code from a high-poverty area to a wealthy suburb while keeping all clinical features constant. If predicted risk decreases substantially with this counterfactual change despite identical clinical presentation, the model may be encoding associations between neighborhood characteristics and outcomes that reflect differential quality of outpatient care, medication adherence due to cost barriers, or food insecurity rather than inherent clinical differences. Similarly, changing a patient’s race from Black to white in a counterfactual while holding clinical features constant should not substantially alter risk predictions if the model is not encoding racial bias, but significant changes suggest the model has learned associations that may reflect historical discrimination in care delivery.

Counterfactual analysis for bias detection faces methodological challenges. The fundamental problem is that many features correlate with race and other protected attributes, making it unclear which features to hold constant when creating counterfactuals and which to allow to change. If we change a patient’s race while keeping income, education, neighborhood, and all other socioeconomic features constant, we create counterfactuals that represent rare or even impossible combinations, potentially producing unreliable model predictions. If we change race and allow correlated features to change as well, we cannot clearly attribute prediction differences to the protected attribute versus correlated features.

Furthermore, some counterfactual interventions may be clinically nonsensical if biological factors genuinely differ across groups. Changing sex while keeping all other features constant including pregnancy status, hormone levels, and anatomical characteristics produces counterfactuals that violate causal structure. Interpretation requires distinguishing between legitimate biological factors that should inform predictions and problematic learned associations that encode discrimination.

Despite these challenges, careful counterfactual analysis provides valuable evidence about model behavior that complements aggregate fairness metrics. The key is constructing counterfactuals that are both plausible given the data distribution and informative about specific bias concerns, then interpreting results with appropriate caution about limitations.

13.4.2 Production Implementation: Model Auditing Framework

We implement a comprehensive model auditing framework that combines counterfactual analysis, systematic bias testing with hand-crafted test cases, and statistical testing to detect harmful associations.

import numpy as np
import pandas as pd
from typing import Dict, List, Optional, Tuple, Callable, Any
from dataclasses import dataclass
from scipy import stats
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@dataclass
class AuditResult:
    """Result from a single bias audit test."""

    test_name: str
    test_description: str
    passed: bool
    test_statistic: Optional[float]
    p_value: Optional[float]
    effect_size: Optional[float]
    details: Dict[str, Any]
    interpretation: str

class ModelAuditor:
    """
    Comprehensive framework for auditing models for harmful associations.

    Implements multiple complementary approaches:
    - Counterfactual analysis varying protected attributes
    - Systematic testing with hand-crafted test cases
    - Statistical tests for unexpected associations
    - Clinical scenario probing
    """

    def __init__(
        self,
        model: Any,
        feature_names: List[str],
        protected_attributes: List[str],
        alpha: float = 0.05
    ):
        """
        Initialize model auditor.

        Parameters
        ----------
        model : Any
            Trained model to audit. Must implement predict or predict_proba.
        feature_names : List[str]
            Names of input features
        protected_attributes : List[str]
            Names of protected attributes to test
        alpha : float, default=0.05
            Significance level for statistical tests
        """
        self.model = model
        self.feature_names = feature_names
        self.protected_attributes = protected_attributes
        self.alpha = alpha

        self.audit_results: List[AuditResult] = []

        # Check prediction method
        if hasattr(model, 'predict_proba'):
            self.predict_fn = lambda X: model.predict_proba(X)[:, 1]
        elif hasattr(model, 'predict'):
            self.predict_fn = model.predict
        else:
            raise ValueError("Model must implement predict or predict_proba")

    def audit_counterfactual_fairness(
        self,
        X: pd.DataFrame,
        protected_attr_col: str,
        reference_value: Any,
        alternative_value: Any,
        fixed_features: Optional[List[str]] = None,
        sample_size: int = 1000
    ) -> AuditResult:
        """
        Test counterfactual fairness by modifying protected attribute.

        Creates counterfactual pairs where protected attribute is changed
        while other features are held constant, then tests whether predictions
        change more than expected by chance.

        Parameters
        ----------
        X : pd.DataFrame
            Feature matrix
        protected_attr_col : str
            Name of protected attribute column to modify
        reference_value : Any
            Reference value for protected attribute (e.g., 'White')
        alternative_value : Any
            Alternative value for protected attribute (e.g., 'Black')
        fixed_features : List[str], optional
            Features to hold constant when creating counterfactuals.
            If None, holds all features except protected attribute constant.
        sample_size : int, default=1000
            Number of samples to use for analysis

        Returns
        -------
        AuditResult
            Results of counterfactual fairness test
        """
        test_name = "Counterfactual Fairness"
        test_description = (
            f"Testing if predictions change when {protected_attr_col} "
            f"changes from {reference_value} to {alternative_value}"
        )

        logger.info(f"Running test: {test_name}")

        # Sample observations with reference value
        reference_mask = X[protected_attr_col] == reference_value
        if reference_mask.sum() < sample_size:
            sample_size = reference_mask.sum()
            logger.warning(
                f"Only {sample_size} samples with {protected_attr_col}="
                f"{reference_value}, using all available"
            )

        reference_sample = X[reference_mask].sample(
            n=sample_size,
            random_state=42
        ).copy()

        # Get original predictions
        original_preds = self.predict_fn(reference_sample)

        # Create counterfactuals by changing protected attribute
        counterfactual_sample = reference_sample.copy()
        counterfactual_sample[protected_attr_col] = alternative_value

        # Get counterfactual predictions
        counterfactual_preds = self.predict_fn(counterfactual_sample)

        # Compute prediction differences
        pred_diffs = counterfactual_preds - original_preds

        # Test if mean difference is significantly different from zero
        t_stat, p_value = stats.ttest_1samp(pred_diffs, 0)
        mean_diff = np.mean(pred_diffs)

        # Compute effect size (Cohen's d)
        effect_size = mean_diff / np.std(pred_diffs) if np.std(pred_diffs) > 0 else 0

        # Determine if test passed
        passed = p_value > self.alpha or abs(effect_size) < 0.2

        # Interpretation
        if passed:
            interpretation = (
                f"Model predictions do not significantly differ when "
                f"{protected_attr_col} changes from {reference_value} to "
                f"{alternative_value} (p={p_value:.4f}, d={effect_size:.3f}). "
                "This suggests the model does not strongly encode bias through "
                "this protected attribute."
            )
        else:
            direction = "increase" if mean_diff > 0 else "decrease"
            interpretation = (
                f"Model predictions {direction} by an average of "
                f"{abs(mean_diff):.4f} when {protected_attr_col} changes from "
                f"{reference_value} to {alternative_value} "
                f"(p={p_value:.4f}, d={effect_size:.3f}). This suggests the "
                "model may have learned associations with the protected "
                "attribute that warrant clinical review."
            )

        return AuditResult(
            test_name=test_name,
            test_description=test_description,
            passed=passed,
            test_statistic=t_stat,
            p_value=p_value,
            effect_size=effect_size,
            details={
                'mean_diff': mean_diff,
                'std_diff': np.std(pred_diffs),
                'median_diff': np.median(pred_diffs),
                'max_diff': np.max(np.abs(pred_diffs)),
                'pct_increased': (pred_diffs > 0).mean() * 100,
                'pct_decreased': (pred_diffs < 0).mean() * 100,
                'sample_size': sample_size
            },
            interpretation=interpretation
        )

    def audit_scenario_pairs(
        self,
        scenario_pairs: List[Tuple[pd.DataFrame, pd.DataFrame, str]],
        expected_relationship: str = "equal"
    ) -> List[AuditResult]:
        """
        Test model behavior on hand-crafted scenario pairs.

        Scenario pairs consist of two cases that should produce similar
        predictions (if expected_relationship='equal') or predictions in
        a specific direction (if expected_relationship='greater' or 'less').

        Parameters
        ----------
        scenario_pairs : List[Tuple[pd.DataFrame, pd.DataFrame, str]]
            List of (scenario_a, scenario_b, description) tuples
        expected_relationship : str, default='equal'
            Expected relationship between predictions:
            - 'equal': predictions should be similar
            - 'greater': scenario_a should predict higher than scenario_b
            - 'less': scenario_a should predict lower than scenario_b

        Returns
        -------
        List[AuditResult]
            Results for each scenario pair
        """
        results = []

        for i, (scenario_a, scenario_b, description) in enumerate(scenario_pairs):
            test_name = f"Scenario Pair {i+1}"
            test_description = description

            logger.info(f"Testing: {description}")

            # Get predictions
            pred_a = self.predict_fn(scenario_a)
            pred_b = self.predict_fn(scenario_b)

            # Ensure arrays
            if np.isscalar(pred_a):
                pred_a = np.array([pred_a])
            if np.isscalar(pred_b):
                pred_b = np.array([pred_b])

            diff = pred_a - pred_b

            # Determine if test passed based on expected relationship
            if expected_relationship == 'equal':
                # Predictions should be similar (within 0.05)
                passed = np.all(np.abs(diff) < 0.05)
                interpretation = (
                    f"Predictions {'match expected equality' if passed else 'differ unexpectedly'}: "
                    f"scenario_a={pred_a[0]:.3f}, scenario_b={pred_b[0]:.3f}, "
                    f"diff={diff[0]:.3f}"
                )
            elif expected_relationship == 'greater':
                passed = np.all(diff > 0)
                interpretation = (
                    f"Predictions {'match expected order' if passed else 'violate expected order'}: "
                    f"scenario_a={pred_a[0]:.3f} {'>' if passed else '≤'} "
                    f"scenario_b={pred_b[0]:.3f}"
                )
            elif expected_relationship == 'less':
                passed = np.all(diff < 0)
                interpretation = (
                    f"Predictions {'match expected order' if passed else 'violate expected order'}: "
                    f"scenario_a={pred_a[0]:.3f} {'<' if passed else '≥'} "
                    f"scenario_b={pred_b[0]:.3f}"
                )
            else:
                raise ValueError(
                    f"Unknown expected_relationship: {expected_relationship}"
                )

            results.append(AuditResult(
                test_name=test_name,
                test_description=test_description,
                passed=passed,
                test_statistic=None,
                p_value=None,
                effect_size=float(np.mean(diff)),
                details={
                    'pred_a': float(pred_a[0]),
                    'pred_b': float(pred_b[0]),
                    'diff': float(diff[0])
                },
                interpretation=interpretation
            ))

        return results

    def audit_unexpected_associations(
        self,
        X: pd.DataFrame,
        y: np.ndarray,
        test_features: List[str],
        control_features: List[str]
    ) -> List[AuditResult]:
        """
        Test for unexpected associations between predictions and features.

        Uses partial correlation analysis to test whether test_features are
        associated with predictions after controlling for control_features.
        Strong associations may indicate problematic proxy discrimination.

        Parameters
        ----------
        X : pd.DataFrame
            Feature matrix
        y : np.ndarray
            True outcomes
        test_features : List[str]
            Features to test for unexpected associations
        control_features : List[str]
            Features to control for in partial correlation analysis

        Returns
        -------
        List[AuditResult]
            Results for each tested feature
        """
        results = []

        # Get model predictions
        predictions = self.predict_fn(X)

        for test_feature in test_features:
            test_name = f"Association Test: {test_feature}"
            test_description = (
                f"Testing if {test_feature} is associated with predictions "
                f"after controlling for {', '.join(control_features)}"
            )

            logger.info(f"Testing: {test_description}")

            # Compute partial correlation
            # This is a simplified version; production systems should use
            # more sophisticated partial correlation methods

            # Fit linear model with control features to predict test feature
            from sklearn.linear_model import LinearRegression

            control_X = X[control_features].values
            test_y = X[test_feature].values

            # Residualize test feature
            model_feature = LinearRegression()
            model_feature.fit(control_X, test_y)
            test_residuals = test_y - model_feature.predict(control_X)

            # Residualize predictions
            model_pred = LinearRegression()
            model_pred.fit(control_X, predictions)
            pred_residuals = predictions - model_pred.predict(control_X)

            # Compute correlation between residuals
            corr, p_value = stats.pearsonr(test_residuals, pred_residuals)

            # Test passes if correlation is not significant or very small
            passed = p_value > self.alpha or abs(corr) < 0.1

            if passed:
                interpretation = (
                    f"{test_feature} is not significantly associated with "
                    f"predictions after controlling for clinical features "
                    f"(r={corr:.3f}, p={p_value:.4f})"
                )
            else:
                interpretation = (
                    f"{test_feature} is significantly associated with "
                    f"predictions after controlling for clinical features "
                    f"(r={corr:.3f}, p={p_value:.4f}). This suggests the "
                    "model may be using this feature as a proxy in ways that "
                    "warrant clinical review."
                )

            results.append(AuditResult(
                test_name=test_name,
                test_description=test_description,
                passed=passed,
                test_statistic=corr,
                p_value=p_value,
                effect_size=corr,
                details={
                    'partial_correlation': corr,
                    'n_samples': len(X)
                },
                interpretation=interpretation
            ))

        return results

    def generate_audit_report(self) -> str:
        """Generate comprehensive audit report."""

        lines = ["=" * 80]
        lines.append("MODEL BIAS AUDIT REPORT")
        lines.append("=" * 80)
        lines.append("")

        # Summary
        total_tests = len(self.audit_results)
        passed_tests = sum(1 for r in self.audit_results if r.passed)
        failed_tests = total_tests - passed_tests

        lines.append("SUMMARY")
        lines.append(f"  Total tests conducted: {total_tests}")
        lines.append(f"  Tests passed: {passed_tests}")
        lines.append(f"  Tests failed: {failed_tests}")

        if failed_tests > 0:
            lines.append("\n  ⚠️  Model failed some bias audits")
        else:
            lines.append("\n  ✓ Model passed all bias audits")

        # Detailed results
        lines.append("\n" + "=" * 80)
        lines.append("DETAILED RESULTS")
        lines.append("=" * 80)

        for result in self.audit_results:
            lines.append(f"\n{result.test_name}")
            lines.append(f"  {result.test_description}")
            lines.append(f"  Status: {'PASSED' if result.passed else 'FAILED'}")

            if result.p_value is not None:
                lines.append(f"  P-value: {result.p_value:.4f}")
            if result.effect_size is not None:
                lines.append(f"  Effect size: {result.effect_size:.3f}")

            lines.append(f"\n  Interpretation:")
            lines.append(f"  {result.interpretation}")

            if result.details:
                lines.append("\n  Additional details:")
                for key, value in result.details.items():
                    if isinstance(value, float):
                        lines.append(f"    {key}: {value:.4f}")
                    else:
                        lines.append(f"    {key}: {value}")

        # Recommendations
        lines.append("\n" + "=" * 80)
        lines.append("RECOMMENDATIONS")
        lines.append("=" * 80)

        if failed_tests == 0:
            lines.append(
                "Model passed all bias audits. Continue monitoring in production "
                "and conduct periodic re-audits to ensure fairness properties are "
                "maintained as data distributions evolve."
            )
        else:
            lines.append(
                "Model failed some bias audits. Recommended actions:"
            )
            lines.append(
                "  1. Clinical review of failed tests to determine if associations "
                "are medically justified"
            )
            lines.append(
                "  2. Consider removing or transforming problematic features"
            )
            lines.append(
                "  3. Implement fairness constraints during model training"
            )
            lines.append(
                "  4. Conduct additional validation in deployment setting"
            )
            lines.append(
                "  5. Implement enhanced monitoring for identified biases"
            )

        return "\n".join(lines)

This model auditing framework enables systematic bias testing through multiple complementary approaches. Counterfactual analysis tests whether protected attribute changes produce prediction differences that suggest encoded bias. Scenario-based testing uses clinically informed test cases to probe specific fairness concerns. Association testing identifies features that predict model outputs in ways not explained by clinical factors, potentially indicating proxy discrimination. Together these approaches provide comprehensive bias detection that goes beyond aggregate metrics to uncover specific harmful associations.

13.5 Designing Adequately Powered Validation Studies

Detecting bias requires validation studies with sufficient statistical power to identify meaningful disparities across demographic subgroups. Standard approaches to sample size calculation focus on overall model performance, often leaving validation studies dramatically underpowered for detecting fairness violations. This section develops methods for planning validation studies that can reliably detect bias while avoiding excessive false positives from multiple testing.

Standard approaches to validation study design focus on estimating overall model performance metrics like AUC or calibration with sufficient precision. A common rule of thumb requires at least ten events per predictor variable for logistic regression models, or validation cohorts of several hundred patients for reasonable confidence intervals on AUC. However, these sample size guidelines entirely ignore the need to assess fairness across demographic subgroups, implicitly assuming that validation cohorts representative of target populations will automatically surface fairness concerns if they exist.

This assumption is dangerously optimistic. Detecting meaningful disparities in model performance across subgroups requires adequate sample sizes within each subgroup, not just overall. If a demographic group comprises only five percent of a validation cohort, achieving adequate power to detect performance differences for this group requires overall validation samples twenty times larger than guidelines based solely on overall performance precision. Many validation studies inadvertently proceed with sample sizes that provide excellent power for estimating overall performance but entirely inadequate power for detecting disparities affecting minority populations most vulnerable to algorithmic bias.

Consider a concrete example. Suppose we want to validate a diagnostic model with AUC of approximately 0.8 overall, and we want to detect if AUC differs by at least 0.05 for a specific demographic subgroup with 90 percent power at alpha = 0.05. Standard power calculations for comparing two AUC values suggest we need approximately 400 positive cases and 400 negative cases in each group we want to compare, totaling 1600 patients if groups are equal size. However, if the subgroup of concern comprises only 10 percent of the population, we need a total validation cohort of 8000 patients to achieve this same power for detecting disparities, far larger than typical validation studies.

The situation becomes even more challenging when we want to test multiple fairness metrics across multiple demographic groups and intersectional identities. Testing 10 subgroups on 3 fairness metrics involves 30 hypothesis tests, requiring corrections for multiple testing that further increase required sample sizes. Without explicit power calculations accounting for fairness evaluation, validation studies may appear adequately sized for overall performance assessment while having minimal probability of detecting significant disparities even if substantial bias exists.

Furthermore, disparities that are statistically detectable may not be clinically meaningful, or conversely, clinically meaningful disparities may exist despite non-significant statistical tests. Defining the minimum disparity magnitude that warrants concern requires clinical judgment about the downstream consequences of model errors. A 5 percent difference in sensitivity might be negligible for a screening test with low stakes but highly consequential for diagnosis of life-threatening conditions where missed cases lead to substantial morbidity and mortality. Validation studies must be designed to detect disparities of magnitudes that matter for the specific clinical application, not arbitrary statistical thresholds.

We now develop practical approaches for validation study design that account for fairness evaluation, including sample size calculations stratified by demographics, adjustment for multiple testing, and frameworks for defining clinically meaningful disparity thresholds.

Power calculations for fairness metrics depend on the specific metric, expected performance levels, and disparity magnitude we want to detect. For sensitivity differences between two groups, we can use standard two-proportion z-tests. The required sample size per group for detecting a difference $\delta$ in sensitivity with power $1-\beta$ at significance level $\alpha$ is approximately:

$$ n = \frac{(z_{1-\alpha/2} + z_{1-\beta})^2 (\bar{p}(1-\bar{p}))}{\delta^2} $$

where $\bar{p}$ is the average sensitivity across groups and $z$ denotes standard normal quantiles. For AUC comparisons, power calculations are more complex but can be approximated using methods developed by DeLong and colleagues that account for AUC variance and correlation between curves estimated on the same test set.

For calibration metrics including expected calibration error, power calculations must account for calibration bin structure and error distributions. Simulation-based power analysis provides flexible approaches that accommodate complex calibration metrics and validation study designs while properly accounting for clustering, missing data, and other practical complications.

Comprehensive validation for fairness across multiple subgroups and metrics requires corrections for multiple testing to control family-wise error rates or false discovery rates. Without such corrections, we expect to find spurious significant disparities purely by chance when testing many hypotheses. Bonferroni corrections provide conservative control of family-wise error rates by testing each hypothesis at level $\alpha/m$ where $m$ is the number of tests, but may be overly conservative when tests are not independent. False discovery rate controlling procedures including Benjamini-Hochberg provide less conservative approaches that may be more appropriate when we expect some true disparities to exist and want to balance discovery against false positives.

An alternative to multiple testing corrections is hierarchical testing strategies that first test for any disparity across groups using omnibus tests, then conduct pairwise comparisons only if the omnibus test is significant. This approach focuses power on detecting whether disparities exist rather than precisely localizing them to specific subgroups, which may be more appropriate when the primary goal is determining whether a model is suitable for deployment rather than understanding exact patterns of bias.

13.6 Integrating Bias Detection into Development Pipelines

Identifying bias after model development is complete makes remediation expensive and disruptive, potentially requiring substantial rework of modeling approaches, data collection, or even reconsideration of whether a model-based solution is appropriate for the application. Integrating bias detection throughout the model development lifecycle enables early identification of fairness concerns when they are most tractable to address, shifts organizational culture toward proactive fairness consideration, and provides documented evidence of fairness testing for regulatory and ethical review.

Modern machine learning development increasingly adopts software engineering practices including continuous integration, automated testing, and version control for models and data. Bias detection should be integrated into these workflows as automated fairness tests that run whenever models are retrained, preventing deployment of models that violate specified fairness requirements. This integration requires defining clear fairness requirements for specific applications, implementing automated tests that evaluate these requirements, establishing escalation procedures when tests fail, and maintaining comprehensive documentation of fairness evaluation throughout development.

The specific fairness requirements depend on the clinical application, stakeholder input, and regulatory context. For a diagnostic model, requirements might specify maximum acceptable disparities in sensitivity and specificity across demographic groups, minimum AUC in each subgroup, and maximum calibration error within groups. For a risk prediction model informing resource allocation, requirements might emphasize calibration within groups and predictive parity to ensure predictions are reliable and mean the same thing across populations. For a model supporting individual treatment decisions, requirements might focus on equal opportunity to ensure patients who would benefit from treatment have equal probability of being identified regardless of demographics.

Once requirements are defined, automated tests verify them on held-out validation data stratified to ensure adequate representation of key demographic subgroups. Tests should fail with clear error messages indicating which requirements were violated and for which groups, facilitating rapid diagnosis of fairness issues. Version control systems track fairness metrics over time, enabling detection of fairness degradation when models are retrained on new data or when architectural changes are made. Continuous monitoring in production supplements development-time testing by detecting fairness issues that emerge after deployment due to distributional shift or changing clinical practices.

When automated fairness tests fail, clear escalation procedures determine appropriate responses. Minor fairness violations might trigger warnings and require sign-off from senior developers acknowledging the issue and documenting justification for proceeding. Moderate violations might require model revision and re-evaluation before deployment. Severe violations should block deployment entirely and trigger comprehensive review of data, modeling approach, and whether the application is appropriate for machine learning.

Comprehensive documentation provides transparency for stakeholders and creates an auditable record for regulatory review. Documentation should include fairness requirements and their justification, test results from development and validation, investigation of fairness violations and remediation efforts, final fairness metrics achieved before deployment, and ongoing monitoring plans for production systems. This documentation serves not only compliance purposes but also facilitates organizational learning about bias detection and mitigation strategies across projects.

13.7 Conclusion and Key Takeaways

This chapter has developed comprehensive approaches for detecting bias in healthcare AI systems through multiple complementary techniques operating at different levels of analysis. The fundamental insight is that aggregate performance metrics are insufficient for ensuring fairness because models can achieve strong average performance while exhibiting severe disparities across patient subgroups. Rigorous bias detection requires explicit evaluation of fairness metrics stratified by protected attributes and social determinants, analysis of whether models rely differently on features across demographic groups, systematic auditing for specific harmful associations, and validation studies designed with adequate power to detect meaningful disparities.

The fairness metrics framework establishes that multiple valid but conflicting definitions of fairness exist, requiring explicit choices about priorities for specific healthcare applications. Demographic parity, equalized odds, calibration within groups, and predictive parity each formalize distinct ethical intuitions about fair treatment, and impossibility results prove they cannot all be satisfied simultaneously. Healthcare applications must prioritize fairness criteria based on clinical context, with calibration often essential because clinical decisions rely on probability estimates, and equalized odds important for distributing error burdens equally across populations.

Proxy discrimination detection reveals how models can encode bias through features correlated with protected attributes even when those attributes are not directly used as inputs. Feature importance analysis stratified by demographics, SHAP value distributions, and correlation analysis identify features that may serve as problematic proxies, though interpreting results requires clinical expertise to distinguish legitimate medical associations from encoded discrimination.

Model auditing through counterfactual analysis, scenario-based testing, and association tests provides additional perspectives on bias that complement aggregate metrics and feature analysis. Counterfactuals examine whether predictions change inappropriately when protected attributes are modified. Scenario-based tests use clinically informed test cases to probe specific fairness concerns. Association tests identify features predicting model outputs in ways not explained by clinical factors.

Adequately powered validation studies require explicit sample size calculations accounting for fairness evaluation across demographic subgroups, not just overall performance estimation. Without such calculations, validation studies may be entirely inadequate for detecting meaningful disparities despite appearing well-sized for overall performance assessment. Multiple testing corrections or hierarchical testing strategies control false positive rates when testing many fairness hypotheses simultaneously.

Integrating bias detection into development pipelines through automated fairness tests, clear escalation procedures, and comprehensive documentation shifts organizational culture toward proactive fairness consideration and prevents deployment of biased models. This integration requires defining application-specific fairness requirements, implementing automated tests, and maintaining transparency through documentation suitable for stakeholders and regulators.

Several critical principles emerge from this work. First, bias detection is necessary but not sufficient for fairness. Identifying disparities is only the first step toward addressing them through improved data collection, modified modeling approaches, or fairness-constrained optimization. Second, no single technique provides complete bias detection. Comprehensive evaluation requires multiple complementary approaches examining model behavior from different perspectives. Third, interpreting bias detection results requires substantial clinical expertise and cannot be fully automated. Statistical tests identify concerning patterns, but determining whether they reflect legitimate medical relationships versus problematic discrimination requires contextual judgment. Fourth, fairness requirements must be defined explicitly for specific applications based on stakeholder input, clinical context, and ethical principles rather than adopted as one-size-fits-all constraints. Finally, bias detection is an ongoing process rather than a one-time evaluation. Models must be monitored in production for emerging fairness issues due to distributional shift, changing clinical practices, or feedback loops where biased predictions affect future data collection.

The stakes are particularly high in healthcare applications affecting underserved populations. When models systematically fail certain patient groups, they can perpetuate and amplify existing disparities in access to care, quality of treatment, and health outcomes. Rigorous bias detection is therefore not merely a technical requirement but an ethical imperative in the pursuit of health equity through AI. The methods developed in this chapter provide practitioners with comprehensive tools for identifying when models exhibit problematic bias, but achieving fairness ultimately requires commitment to addressing identified issues rather than simply documenting them.

Bibliography

Agarwal, A., Beygelzimer, A., Dudik, M., Langford, J., & Wallach, H. (2018). A reductions approach to fair classification. Proceedings of the 35th International Conference on Machine Learning, 80, 60-69. http://proceedings.mlr.press/v80/agarwal18a.html

Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and Machine Learning: Limitations and Opportunities. MIT Press. https://fairmlbook.org

Bellamy, R. K., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., … & Zhang, Y. (2019). AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias. IBM Journal of Research and Development, 63(4/5), 4-1. https://doi.org/10.1147/JRD.2019.2942287

Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of the 1st Conference on Fairness, Accountability and Transparency, 77-91. http://proceedings.mlr.press/v81/buolamwini18a.html

Cabrera, Á. A., Epperson, W., Hohman, F., Kahng, M., Morgenstern, J., & Chau, D. H. (2019). FairVis: Visual analytics for discovering intersectional bias in machine learning. 2019 IEEE Conference on Visual Analytics Science and Technology (VAST), 46-56. https://doi.org/10.1109/VAST47406.2019.8986948

Chen, I. Y., Johansson, F. D., & Sontag, D. (2018). Why is my classifier discriminatory? Advances in Neural Information Processing Systems, 31, 3543-3554. https://proceedings.neurips.cc/paper/2018/file/1f1baa5b8edac74eb4eaa329f14a0361-Paper.pdf

Chen, I. Y., Szolovits, P., & Ghassemi, M. (2019). Can AI help reduce disparities in general medical and mental health care? AMA Journal of Ethics, 21(2), 167-179. https://doi.org/10.1001/amajethics.2019.167

Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153-163. https://doi.org/10.1089/big.2016.0047

Chouldechova, A., & Roth, A. (2018). The frontiers of fairness in machine learning. arXiv preprint arXiv:1810.08810. https://arxiv.org/abs/1810.08810

Corbett-Davies, S., & Goel, S. (2018). The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv preprint arXiv:1808.00023. https://arxiv.org/abs/1808.00023

Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, 214-226. https://doi.org/10.1145/2090236.2090255

Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., & Venkatasubramanian, S. (2015). Certifying and removing disparate impact. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 259-268. https://doi.org/10.1145/2783258.2783311

Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635-E3644. https://doi.org/10.1073/pnas.1720347115

Gianfrancesco, M. A., Tamang, S., Yazdany, J., & Schmajuk, G. (2018). Potential biases in machine learning algorithms using electronic health record data. JAMA Internal Medicine, 178(11), 1544-1547. https://doi.org/10.1001/jamainternmed.2018.3763

Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems, 29, 3315-3323. https://proceedings.neurips.cc/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d-Paper.pdf

Kamiran, F., & Calders, T. (2012). Data preprocessing techniques for discrimination prevention. Knowledge and Information Systems, 33(1), 1-33. https://doi.org/10.1007/s10115-011-0463-8

Kleinberg, J., Ludwig, J., Mullainathan, S., & Rambachan, A. (2018). Algorithmic fairness. AEA Papers and Proceedings, 108, 22-27. https://doi.org/10.1257/pandp.20181018

Kleinberg, J., Mullainathan, S., & Raghavan, M. (2017). Inherent trade-offs in the fair determination of risk scores. Proceedings of the 8th Innovations in Theoretical Computer Science Conference, 43, 1-23. https://doi.org/10.4230/LIPIcs.ITCS.2017.43

Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). Counterfactual fairness. Advances in Neural Information Processing Systems, 30, 4066-4076. https://proceedings.neurips.cc/paper/2017/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf

Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., & Hardt, M. (2018). Delayed impact of fair machine learning. Proceedings of the 35th International Conference on Machine Learning, 80, 3150-3158. http://proceedings.mlr.press/v80/liu18c.html

Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4765-4774. https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf

Mitchell, S., Potash, E., Barocas, S., D’Amour, A., & Lum, K. (2021). Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application, 8, 141-163. https://doi.org/10.1146/annurev-statistics-042720-125902

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6), 1-35. https://doi.org/10.1145/3457607

Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447-453. https://doi.org/10.1126/science.aax2342

Paulus, J. K., & Kent, D. M. (2020). Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities. npj Digital Medicine, 3(1), 1-8. https://doi.org/10.1038/s41746-020-0304-9

Pierson, E., Cutler, D. M., Leskovec, J., Mullainathan, S., & Obermeyer, Z. (2021). An algorithmic approach to reducing unexplained pain disparities in underserved populations. Nature Medicine, 27(1), 136-140. https://doi.org/10.1038/s41591-020-01192-7

Rajkomar, A., Hardt, M., Howell, M. D., Corrado, G., & Chin, M. H. (2018). Ensuring fairness in machine learning to advance health equity. Annals of Internal Medicine, 169(12), 866-872. https://doi.org/10.7326/M18-1990

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135-1144. https://doi.org/10.1145/2939672.2939778

Selbst, A. D., Boyd, D., Friedler, S. A., Venkatasubramanian, S., & Vertesi, J. (2019). Fairness and abstraction in sociotechnical systems. Proceedings of the Conference on Fairness, Accountability, and Transparency, 59-68. https://doi.org/10.1145/3287560.3287598

Suresh, H., & Guttag, J. V. (2021). A framework for understanding sources of harm throughout the machine learning life cycle. Proceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, 1-9. https://doi.org/10.1145/3465416.3483305

Ustun, B., & Rudin, C. (2016). Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102(3), 349-391. https://doi.org/10.1007/s10994-015-5528-6

Verma, S., & Rubin, J. (2018). Fairness definitions explained. Proceedings of the International Workshop on Software Fairness, 1-7. https://doi.org/10.1145/3194770.3194776

Vyas, D. A., Eisenstein, L. G., & Jones, D. S. (2020). Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. New England Journal of Medicine, 383(9), 874-882. https://doi.org/10.1056/NEJMms2004740

Wachter, S., Mittelstadt, B., & Russell, C. (2021). Why fairness cannot be automated: Bridging the gap between EU non-discrimination law and AI. Computer Law & Security Review, 41, 105567. https://doi.org/10.1016/j.clsr.2021.105567

Zafar, M. B., Valera, I., Gomez Rodriguez, M., & Gummadi, K. P. (2017). Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. Proceedings of the 26th International Conference on World Wide Web, 1171-1180. https://doi.org/10.1145/3038912.3052660

Zhang, B. H., Lemoine, B., & Mitchell, M. (2018). Mitigating unwanted biases with adversarial learning. Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, 335-340. https://doi.org/10.1145/3278721.3278779

Zou, J., & Schiebinger, L. (2018). AI can be sexist and racist—it’s time to make it fair. Nature, 559(7714), 324-326. https://doi.org/10.1038/d41586-018-05707-8