Chapter 2: Mathematical Foundations for Clinical AI
Learning Objectives
By the end of this chapter, readers will be able to:
-
Apply linear algebra concepts including vector spaces, matrix operations, eigendecompositions, and singular value decomposition to healthcare data problems, with specific understanding of how these mathematical structures can either reinforce or challenge existing health disparities.
-
Develop probabilistic reasoning for clinical decision-making under uncertainty, including proper application of Bayes’ theorem when prior probabilities reflect historical discrimination rather than biological differences, and understanding of conditional independence assumptions that may fail in the presence of unmeasured confounders like structural racism.
-
Implement bias-aware statistical inference methods that account for systematic differences in data collection quality across populations, including approaches for handling missing data mechanisms that correlate with social determinants of health.
-
Design and validate mathematical models that explicitly quantify uncertainty in ways that differ across patient subgroups, recognizing when standard assumptions about error distributions or independence break down in real healthcare settings serving diverse populations.
-
Critically evaluate mathematical formulations of fairness and equity in machine learning, understanding the tradeoffs between different fairness definitions and their appropriateness for specific healthcare applications affecting underserved communities.
2.1 Introduction: Why Mathematics Matters for Health Equity
Mathematics provides the formal language through which we express relationships in data, quantify uncertainty, and optimize decisions. In healthcare artificial intelligence, mathematical choices shape everything from how we represent patient similarity to how we define what it means for an algorithm to be fair. These choices are never merely technical but rather encode assumptions about the world that have profound implications for health equity.
Consider the seemingly simple task of measuring similarity between patients to identify those who might benefit from similar treatments. The mathematical framework we choose for this measurement determines which patient characteristics we treat as comparable and which we treat as fundamentally different. If we represent patients as points in a high-dimensional space and use Euclidean distance to measure similarity, we implicitly assume that all features contribute equally to meaningful clinical similarity and that the relationships between features are linear. These assumptions may work reasonably well for some patient populations but fail catastrophically for others.
A concrete example illustrates the stakes. Suppose we are developing a system to identify patients similar to a given individual for the purpose of predicting treatment response based on outcomes in similar patients. We represent each patient as a vector containing age, body mass index, blood pressure, hemoglobin A1c, and several other clinical measurements. For a middle-aged patient with well-controlled diabetes and good healthcare access, this representation may work well because their clinical trajectory is largely determined by the measured biomedical variables. But for a patient facing housing instability, food insecurity, and transportation barriers to care, the measured clinical variables capture only a small part of what determines their health trajectory. The unmeasured social factors may be far more important than the measured clinical ones, yet our mathematical similarity metric knows nothing about them.
The result is that our patient similarity algorithm will suggest treatment strategies based on patients whose clinical measurements are similar but whose actual circumstances are profoundly different. The recommendations may be inappropriate or even harmful, not because the mathematics was wrong in some abstract sense but because the mathematical framework was built on assumptions that don’t hold for patients whose lives are shaped by social and structural factors that standard clinical data fails to capture.
This chapter develops mathematical foundations with sustained attention to how mathematical choices interact with health equity considerations. We begin with linear algebra because matrix operations pervade healthcare data analysis, from basic data transformations to sophisticated dimensionality reduction methods. But rather than presenting linear algebra in isolation, we ground every concept in healthcare applications and examine how the mathematical structures can either reveal or obscure health disparities. We then develop probability theory as the foundation for reasoning under uncertainty, but we do so while acknowledging that probability distributions in healthcare data often reflect social processes rather than natural phenomena, and that standard probabilistic assumptions frequently fail in ways that systematically disadvantage certain populations.
Throughout, we implement production-ready code that demonstrates not just how to perform mathematical operations but how to do so while maintaining critical awareness of their limitations and biases. The implementations include extensive validation and sensitivity analyses that surface when mathematical assumptions break down, with particular attention to whether these failures affect different patient populations differently. By the end of this chapter, you will have both the mathematical sophistication needed for advanced healthcare AI work and the critical lens needed to ensure that mathematical elegance doesn’t come at the cost of health equity.
2.2 Linear Algebra for Healthcare Data
Linear algebra provides the mathematical framework for representing and manipulating healthcare data in ways that enable both human interpretation and algorithmic processing. Vectors represent individual patients or clinical measurements, matrices organize collections of observations or relationships between variables, and linear transformations enable us to change perspectives on data or reduce dimensionality while preserving important structure. Understanding linear algebra deeply means understanding not just the mechanics of matrix operations but also what these operations mean in healthcare contexts and how they can introduce or reveal bias.
2.2.1 Vector Spaces and Patient Representation
At the most fundamental level, we represent each patient as a point in a high-dimensional space where each dimension corresponds to a measured or derived feature. This vector representation enables us to apply geometric intuitions about distance, angles, and projections to reason about patient similarity, cluster structure, and predictive relationships. However, the choice of how to construct these vector representations involves numerous decisions that affect whether the resulting mathematical structure serves or undermines health equity goals.
Consider a patient represented by clinical measurements including age, systolic blood pressure, serum creatinine, body mass index, and hemoglobin A1c. We can write this patient as a vector in five-dimensional space. However, the raw measurements have different units and scales: age in years, blood pressure in millimeters of mercury, creatinine in milligrams per deciliter, and so on. Computing distances between patients using raw measurements would mean that variables with larger numerical ranges would dominate the distance calculation, even if they are not more clinically important.
The standard mathematical solution is to standardize the features by subtracting the mean and dividing by the standard deviation, transforming each variable to have mean zero and unit variance. This standardization ensures that all features contribute comparably to distance calculations. But this seemingly neutral mathematical operation embeds important assumptions. When we standardize using the overall population mean and standard deviation, we implicitly assume that deviations from the population average are equally meaningful for all patients. Yet if different patient subgroups have systematically different distributions of these variables due to social determinants of health or differential access to care, standardizing to the overall population statistics may obscure rather than reveal clinically meaningful patterns.
Let me make this concrete with an implementation that demonstrates both standard vector operations and equity-aware alternatives:
"""
Patient Vector Representation with Equity Considerations
This module implements patient vector representations with explicit attention
to how standardization and distance metric choices can introduce bias when
patient populations have systematically different feature distributions.
"""
import numpy as np
import pandas as pd
from typing import Dict, List, Optional, Tuple, Union
from dataclasses import dataclass
from sklearn.preprocessing import StandardScaler, RobustScaler
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class PatientVector:
"""
Represents a patient as a vector in feature space with metadata about
the patient's demographic characteristics and data quality.
Attributes:
patient_id: Unique patient identifier
features: Numeric feature vector
feature_names: Names corresponding to each feature dimension
demographic_group: Demographic stratification information
data_quality_score: Completeness and reliability score for this patient's data
missingness_indicators: Binary indicators for which features are imputed
"""
patient_id: str
features: np.ndarray
feature_names: List[str]
demographic_group: Optional[Dict[str, str]] = None
data_quality_score: Optional[float] = None
missingness_indicators: Optional[np.ndarray] = None
def __post_init__(self):
"""Validate that feature array dimensions match feature names."""
if len(self.features) != len(self.feature_names):
raise ValueError(
f"Feature array length ({len(self.features)}) must match "
f"feature names length ({len(self.feature_names)})"
)
class EquityAwareVectorSpace:
"""
Creates and manages vector representations of patients with explicit
attention to how standardization choices affect fairness across
patient populations with different feature distributions.
"""
def __init__(
self,
standardization_method: str = 'standard',
group_specific_scaling: bool = False
):
"""
Initialize vector space with configurable standardization approach.
Args:
standardization_method: Method for feature scaling
'standard': Mean centering and unit variance scaling
'robust': Median centering and IQR scaling (less sensitive to outliers)
'minmax': Scale to [0, 1] range
'none': No standardization
group_specific_scaling: Whether to use group-specific scaling parameters
If True, fits separate scalers for each demographic group
This preserves within-group variation patterns
"""
self.standardization_method = standardization_method
self.group_specific_scaling = group_specific_scaling
self.global_scaler: Optional[Union[StandardScaler, RobustScaler]] = None
self.group_scalers: Dict[str, Union[StandardScaler, RobustScaler]] = {}
self.feature_names: List[str] = []
logger.info(
f"Initialized vector space with {standardization_method} standardization, "
f"group_specific_scaling={group_specific_scaling}"
)
def fit(
self,
patient_df: pd.DataFrame,
feature_columns: List[str],
demographic_column: Optional[str] = None
) -> 'EquityAwareVectorSpace':
"""
Fit standardization parameters from training data with option for
group-specific standardization to preserve within-group patterns.
Args:
patient_df: DataFrame with patient data
feature_columns: Column names to use as features
demographic_column: Column for group-specific standardization
Returns:
Self for method chaining
"""
self.feature_names = feature_columns
X = patient_df[feature_columns].values
# Fit global scaler
if self.standardization_method == 'standard':
self.global_scaler = StandardScaler()
elif self.standardization_method == 'robust':
self.global_scaler = RobustScaler()
elif self.standardization_method == 'minmax':
from sklearn.preprocessing import MinMaxScaler
self.global_scaler = MinMaxScaler()
elif self.standardization_method == 'none':
self.global_scaler = None
else:
raise ValueError(f"Unknown standardization method: {self.standardization_method}")
if self.global_scaler is not None:
self.global_scaler.fit(X)
# Fit group-specific scalers if requested
if self.group_specific_scaling and demographic_column is not None:
if demographic_column not in patient_df.columns:
logger.warning(
f"Demographic column {demographic_column} not found. "
f"Using global scaler only."
)
else:
for group in patient_df[demographic_column].unique():
if pd.notna(group):
group_data = patient_df[patient_df[demographic_column] == group]
X_group = group_data[feature_columns].values
if len(X_group) > 1: # Need at least 2 samples for scaling
if self.standardization_method == 'standard':
group_scaler = StandardScaler()
elif self.standardization_method == 'robust':
group_scaler = RobustScaler()
elif self.standardization_method == 'minmax':
from sklearn.preprocessing import MinMaxScaler
group_scaler = MinMaxScaler()
else:
group_scaler = None
if group_scaler is not None:
group_scaler.fit(X_group)
self.group_scalers[str(group)] = group_scaler
logger.info(f"Fitted vector space on {len(patient_df)} patients")
if self.group_scalers:
logger.info(f"Created group-specific scalers for {len(self.group_scalers)} groups")
return self
def transform(
self,
patient_df: pd.DataFrame,
demographic_column: Optional[str] = None
) -> List[PatientVector]:
"""
Transform patients into vector representations using fitted scaling.
Args:
patient_df: DataFrame with patient data
demographic_column: Column with demographic group labels
Returns:
List of PatientVector objects
"""
if not self.feature_names:
raise ValueError("Must call fit() before transform()")
patient_vectors = []
for idx, row in patient_df.iterrows():
patient_id = str(row.get('patient_id', f'patient_{idx}'))
# Extract feature values
feature_values = row[self.feature_names].values.astype(float)
# Determine which scaler to use
demographic_group = None
if demographic_column and demographic_column in row:
demographic_group = {demographic_column: str(row[demographic_column])}
group_key = str(row[demographic_column])
else:
group_key = None
# Apply scaling
if self.group_specific_scaling and group_key in self.group_scalers:
# Use group-specific scaler
scaler = self.group_scalers[group_key]
scaled_features = scaler.transform(feature_values.reshape(1, -1)).flatten()
elif self.global_scaler is not None:
# Use global scaler
scaled_features = self.global_scaler.transform(
feature_values.reshape(1, -1)
).flatten()
else:
# No scaling
scaled_features = feature_values
# Track missingness
missingness_indicators = pd.isna(row[self.feature_names]).values.astype(int)
# Calculate data quality score
data_quality = 1.0 - missingness_indicators.mean()
patient_vector = PatientVector(
patient_id=patient_id,
features=scaled_features,
feature_names=self.feature_names,
demographic_group=demographic_group,
data_quality_score=data_quality,
missingness_indicators=missingness_indicators
)
patient_vectors.append(patient_vector)
return patient_vectors
def compute_distance_matrix(
self,
patient_vectors: List[PatientVector],
metric: str = 'euclidean',
weight_by_quality: bool = False
) -> Tuple[np.ndarray, np.ndarray]:
"""
Compute pairwise distances between patients with optional quality weighting.
Standard distance metrics can be misleading when data quality differs
systematically across populations. Quality weighting down-weights distances
involving patients with poor data quality.
Args:
patient_vectors: List of patient vector representations
metric: Distance metric ('euclidean', 'manhattan', 'cosine')
weight_by_quality: Whether to weight distances by data quality scores
Returns:
Tuple of (distance matrix, quality-adjusted distance matrix if weight_by_quality)
"""
n = len(patient_vectors)
distance_matrix = np.zeros((n, n))
quality_matrix = np.ones((n, n))
for i in range(n):
for j in range(i + 1, n):
# Compute base distance
if metric == 'euclidean':
dist = np.linalg.norm(
patient_vectors[i].features - patient_vectors[j].features
)
elif metric == 'manhattan':
dist = np.sum(np.abs(
patient_vectors[i].features - patient_vectors[j].features
))
elif metric == 'cosine':
dot_product = np.dot(
patient_vectors[i].features,
patient_vectors[j].features
)
norm_i = np.linalg.norm(patient_vectors[i].features)
norm_j = np.linalg.norm(patient_vectors[j].features)
if norm_i > 0 and norm_j > 0:
dist = 1.0 - (dot_product / (norm_i * norm_j))
else:
dist = 1.0 # Maximum distance if either vector is zero
else:
raise ValueError(f"Unknown metric: {metric}")
distance_matrix[i, j] = dist
distance_matrix[j, i] = dist
# Compute quality adjustment
if weight_by_quality:
quality_i = patient_vectors[i].data_quality_score or 1.0
quality_j = patient_vectors[j].data_quality_score or 1.0
quality_weight = np.sqrt(quality_i * quality_j)
quality_matrix[i, j] = quality_weight
quality_matrix[j, i] = quality_weight
if weight_by_quality:
# Adjust distances by quality - higher quality pairs get lower adjusted distance
adjusted_distance = distance_matrix / quality_matrix
return distance_matrix, adjusted_distance
else:
return distance_matrix, distance_matrix
def analyze_distance_bias(
self,
patient_vectors: List[PatientVector],
distance_matrix: np.ndarray,
demographic_attribute: str
) -> Dict[str, any]:
"""
Analyze whether distance calculations are biased across demographic groups.
This checks whether patients from certain demographic groups tend to be
closer or farther from other patients in the vector space, which could
indicate that the feature representation or standardization is inappropriate.
Args:
patient_vectors: List of patient vectors
distance_matrix: Pairwise distance matrix
demographic_attribute: Which demographic attribute to analyze
Returns:
Dictionary with bias analysis results
"""
analysis = {
'mean_within_group_distance': {},
'mean_between_group_distance': {},
'group_cohesion_scores': {},
'bias_detected': False
}
# Extract demographic groups
groups = {}
for i, pv in enumerate(patient_vectors):
if pv.demographic_group and demographic_attribute in pv.demographic_group:
group = pv.demographic_group[demographic_attribute]
if group not in groups:
groups[group] = []
groups[group].append(i)
if len(groups) < 2:
logger.warning("Need at least 2 demographic groups for bias analysis")
return analysis
# Compute within-group distances
for group, indices in groups.items():
if len(indices) > 1:
within_distances = []
for i in range(len(indices)):
for j in range(i + 1, len(indices)):
within_distances.append(distance_matrix[indices[i], indices[j]])
analysis['mean_within_group_distance'][group] = np.mean(within_distances)
# Compute between-group distances
group_pairs = [(g1, g2) for g1 in groups for g2 in groups if g1 < g2]
for g1, g2 in group_pairs:
between_distances = []
for i in groups[g1]:
for j in groups[g2]:
between_distances.append(distance_matrix[i, j])
pair_key = f"{g1}_vs_{g2}"
analysis['mean_between_group_distance'][pair_key] = np.mean(between_distances)
# Compute cohesion scores (ratio of within to between distance)
for group in groups:
within_dist = analysis['mean_within_group_distance'].get(group, 0)
# Average between-group distance for this group
between_dists = []
for other_group in groups:
if other_group != group:
pair_key = f"{min(group, other_group)}_vs_{max(group, other_group)}"
if pair_key in analysis['mean_between_group_distance']:
between_dists.append(
analysis['mean_between_group_distance'][pair_key]
)
if between_dists:
avg_between = np.mean(between_dists)
if avg_between > 0:
cohesion = within_dist / avg_between
analysis['group_cohesion_scores'][group] = cohesion
# Check for bias: large differences in cohesion suggest representation issues
cohesion_values = list(analysis['group_cohesion_scores'].values())
if len(cohesion_values) >= 2:
cohesion_cv = np.std(cohesion_values) / np.mean(cohesion_values)
if cohesion_cv > 0.3: # Coefficient of variation > 30%
analysis['bias_detected'] = True
logger.warning(
f"Bias detected in vector space representation. "
f"Cohesion scores vary substantially across groups (CV={cohesion_cv:.3f}). "
f"Some groups may be poorly represented in this feature space."
)
return analysis
This implementation demonstrates several equity-focused approaches to vector representation. The group-specific scaling option allows us to standardize features separately within each demographic group, which preserves within-group variation patterns that might be clinically meaningful even though they differ across groups. The quality-weighted distance calculation accounts for the fact that distances involving patients with poor data quality may be less reliable. And the bias analysis functionality explicitly checks whether the vector space representation creates systematic differences in how clustered or dispersed different demographic groups appear, which could indicate that the feature representation doesn’t work equally well for all populations.
2.2.2 Matrix Operations and Data Transformations
Matrices organize collections of patient vectors and enable batch operations that transform entire datasets. Understanding matrix operations means understanding how data transformations affect the structure of information and how these transformations can introduce or reveal bias. We focus on operations that commonly arise in healthcare data analysis including matrix multiplication for feature transformations, matrix inverses for solving linear systems, and matrix decompositions that reveal latent structure.
Matrix multiplication is perhaps the most fundamental operation, enabling us to transform data from one representation to another through linear combinations of features. When we multiply a data matrix by a transformation matrix, we are computing new features as weighted combinations of the original features. This operation is at the heart of dimensionality reduction methods, linear models, and neural network layers. However, the choice of transformation matrix embeds assumptions about which combinations of features are meaningful, and these assumptions may not hold equally well across diverse patient populations.
Consider a simple example where we want to create a composite measure of cardiovascular risk from multiple individual risk factors including blood pressure, cholesterol, smoking status, and diabetes. We can express this as matrix multiplication where each patient is a row in our data matrix and each risk factor is a column, and we multiply by a weight vector that defines how much each factor contributes to the composite score. The standard approach is to set weights based on epidemiological studies that estimated the association between each risk factor and cardiovascular events. But if those studies were conducted primarily in populations different from the ones we are now trying to serve, the weights may be inappropriate and could systematically misestimate risk for certain groups.
Let me implement matrix operations with explicit attention to these equity considerations:
class EquityAwareMatrixOperations:
"""
Matrix operations for healthcare data with explicit validation that
transformations don't introduce systematic bias across patient populations.
"""
def __init__(self):
"""Initialize matrix operations handler."""
self.transformation_history: List[Dict[str, any]] = []
logger.info("Initialized equity-aware matrix operations")
def linear_combination_features(
self,
data_matrix: np.ndarray,
weights: np.ndarray,
feature_names: List[str],
new_feature_name: str,
demographic_labels: Optional[np.ndarray] = None
) -> Tuple[np.ndarray, Dict[str, any]]:
"""
Create new features as linear combinations of existing features with
validation that the transformation works consistently across demographic groups.
Args:
data_matrix: (n_patients, n_features) data matrix
weights: (n_features,) weight vector for linear combination
feature_names: Names of input features
new_feature_name: Name for the newly created feature
demographic_labels: Optional (n_patients,) array of demographic group labels
Returns:
Tuple of (new feature vector, validation results)
"""
if data_matrix.shape[1] != len(weights):
raise ValueError(
f"Data matrix has {data_matrix.shape[1]} features but "
f"weight vector has {len(weights)} elements"
)
# Compute linear combination
new_feature = data_matrix @ weights
# Validate transformation
validation = {
'new_feature_name': new_feature_name,
'weight_vector': weights.tolist(),
'feature_names': feature_names,
'transformation_bias_detected': False
}
if demographic_labels is not None:
# Check if transformation affects groups differently
unique_groups = np.unique(demographic_labels[~pd.isna(demographic_labels)])
group_correlations = {}
for group in unique_groups:
group_mask = demographic_labels == group
# For each original feature, compute correlation with new feature
correlations = []
for j in range(data_matrix.shape[1]):
if np.std(data_matrix[group_mask, j]) > 0:
corr = np.corrcoef(
data_matrix[group_mask, j],
new_feature[group_mask]
)[0, 1]
correlations.append(corr)
group_correlations[str(group)] = {
'mean_correlation': np.mean(correlations),
'feature_correlations': dict(zip(feature_names, correlations))
}
validation['group_correlations'] = group_correlations
# Check for large differences in how features relate to composite
mean_corrs = [v['mean_correlation'] for v in group_correlations.values()]
if len(mean_corrs) > 1:
corr_cv = np.std(mean_corrs) / np.mean(np.abs(mean_corrs))
if corr_cv > 0.3:
validation['transformation_bias_detected'] = True
logger.warning(
f"Linear combination {new_feature_name} shows different "
f"relationships to input features across demographic groups. "
f"This may indicate the weights are not appropriate for all populations."
)
# Record transformation
self.transformation_history.append(validation)
return new_feature, validation
def center_data_matrix(
self,
data_matrix: np.ndarray,
demographic_labels: Optional[np.ndarray] = None,
group_specific: bool = False
) -> Tuple[np.ndarray, Dict[str, np.ndarray]]:
"""
Center data matrix by subtracting means with option for group-specific centering.
Standard centering subtracts the overall mean, which can obscure within-group
patterns when groups have systematically different means. Group-specific
centering preserves within-group variation structure.
Args:
data_matrix: (n_patients, n_features) data matrix
demographic_labels: Optional group labels for group-specific centering
group_specific: Whether to center within each group separately
Returns:
Tuple of (centered matrix, dictionary of centering parameters)
"""
if not group_specific or demographic_labels is None:
# Global centering
means = np.mean(data_matrix, axis=0)
centered = data_matrix - means
centering_params = {
'method': 'global',
'global_means': means
}
else:
# Group-specific centering
centered = np.zeros_like(data_matrix)
group_means = {}
unique_groups = np.unique(demographic_labels[~pd.isna(demographic_labels)])
for group in unique_groups:
group_mask = demographic_labels == group
group_data = data_matrix[group_mask]
if len(group_data) > 0:
group_mean = np.mean(group_data, axis=0)
centered[group_mask] = group_data - group_mean
group_means[str(group)] = group_mean
centering_params = {
'method': 'group_specific',
'group_means': group_means
}
return centered, centering_params
def compute_covariance_matrix(
self,
data_matrix: np.ndarray,
demographic_labels: Optional[np.ndarray] = None,
analyze_group_differences: bool = True
) -> Tuple[np.ndarray, Optional[Dict[str, any]]]:
"""
Compute covariance matrix with optional analysis of whether covariance
structure differs across demographic groups.
Different covariance structures across groups indicate that relationships
between variables differ systematically, which has implications for
modeling approaches that assume homogeneous covariance.
Args:
data_matrix: (n_patients, n_features) centered data matrix
demographic_labels: Optional group labels
analyze_group_differences: Whether to test for covariance structure differences
Returns:
Tuple of (covariance matrix, optional analysis results)
"""
n_samples = data_matrix.shape[0]
# Compute overall covariance
cov_matrix = (data_matrix.T @ data_matrix) / (n_samples - 1)
analysis_results = None
if analyze_group_differences and demographic_labels is not None:
analysis_results = {
'group_covariances': {},
'covariance_homogeneity_test': {}
}
unique_groups = np.unique(demographic_labels[~pd.isna(demographic_labels)])
group_cov_matrices = {}
for group in unique_groups:
group_mask = demographic_labels == group
group_data = data_matrix[group_mask]
if len(group_data) > data_matrix.shape[1]: # Need n > p for valid covariance
group_cov = (group_data.T @ group_data) / (len(group_data) - 1)
group_cov_matrices[str(group)] = group_cov
# Store summary statistics
analysis_results['group_covariances'][str(group)] = {
'trace': np.trace(group_cov),
'determinant': np.linalg.det(group_cov) if group_cov.shape[0] > 0 else 0,
'frobenius_norm': np.linalg.norm(group_cov, 'fro')
}
# Test for homogeneity using Box's M test approximation
# This tests whether covariance matrices are equal across groups
if len(group_cov_matrices) >= 2:
# Compute pooled covariance
pooled_cov = cov_matrix
# Compare each group covariance to pooled
for group, group_cov in group_cov_matrices.items():
diff_norm = np.linalg.norm(group_cov - pooled_cov, 'fro')
pooled_norm = np.linalg.norm(pooled_cov, 'fro')
relative_difference = diff_norm / pooled_norm if pooled_norm > 0 else 0
analysis_results['covariance_homogeneity_test'][group] = {
'relative_difference': relative_difference
}
if relative_difference > 0.3:
logger.warning(
f"Group {group} has substantially different covariance structure "
f"(relative difference = {relative_difference:.3f}). "
f"Models assuming homogeneous covariance may perform poorly."
)
return cov_matrix, analysis_results
def compute_correlation_matrix(
self,
data_matrix: np.ndarray,
feature_names: List[str],
demographic_labels: Optional[np.ndarray] = None
) -> Tuple[np.ndarray, pd.DataFrame]:
"""
Compute correlation matrix with detailed reporting suitable for identifying
features that may serve as proxies for demographic characteristics.
Args:
data_matrix: (n_patients, n_features) data matrix
feature_names: Names of features
demographic_labels: Optional group labels
Returns:
Tuple of (correlation matrix, correlation analysis DataFrame)
"""
# Standardize features for correlation computation
data_std = (data_matrix - np.mean(data_matrix, axis=0)) / (
np.std(data_matrix, axis=0) + 1e-8
)
# Compute correlation matrix
corr_matrix = (data_std.T @ data_std) / (data_matrix.shape[0] - 1)
# Create detailed report
n_features = len(feature_names)
correlation_details = []
for i in range(n_features):
for j in range(i + 1, n_features):
correlation_details.append({
'feature_1': feature_names[i],
'feature_2': feature_names[j],
'correlation': corr_matrix[i, j],
'abs_correlation': abs(corr_matrix[i, j])
})
corr_df = pd.DataFrame(correlation_details)
corr_df = corr_df.sort_values('abs_correlation', ascending=False)
# Flag potentially problematic high correlations
high_corr = corr_df[corr_df['abs_correlation'] > 0.8]
if len(high_corr) > 0:
logger.info(
f"Found {len(high_corr)} feature pairs with |correlation| > 0.8. "
f"High correlations may indicate redundant features or proxy variables."
)
return corr_matrix, corr_df
This implementation provides matrix operations with built-in equity checking. The linear combination validation checks whether the transformation relates to input features differently across demographic groups, which could indicate that the chosen weights aren’t appropriate for all populations. The group-specific centering option preserves within-group correlation structure that global centering would erase. And the covariance analysis explicitly tests whether the covariance structure is homogeneous across groups, surfacing violations of this common modeling assumption.
2.2.3 Eigendecomposition and Principal Component Analysis
Eigendecomposition reveals the intrinsic structure of matrices by finding directions of maximal variation and the scaling factors along those directions. In healthcare data analysis, eigendecomposition underlies principal components analysis, which finds low-dimensional representations that capture most of the variation in high-dimensional data. However, PCA and related methods can introduce subtle biases when different patient populations have different correlation structures, as the principal components may primarily capture variation patterns from the dominant population while poorly representing patterns specific to minority populations.
The mathematical foundation is elegant. Given a covariance matrix Σ, eigendecomposition finds vectors v and scalars λ such that Σv = λv. The eigenvector v defines a direction in feature space, and the eigenvalue λ quantifies the amount of variance along that direction. The largest eigenvalue corresponds to the direction of maximum variance, the second largest to the direction of maximum variance orthogonal to the first, and so on. Principal components analysis uses these eigenvectors as a new coordinate system in which to represent the data, with the hope that the first few principal components capture most of the meaningful variation.
The health equity concern arises when we apply PCA to data from heterogeneous populations. The principal components that best represent the dominant population may not best represent minority populations, especially if the minority populations have different correlation structures. The result is that dimensionality reduction may work well for some patients while losing important information for others. This differential information loss can then propagate through downstream modeling, leading to algorithms that perform better for well-represented populations.
Let me implement PCA with explicit equity safeguards:
from sklearn.decomposition import PCA
from scipy.linalg import eigh
class EquityAwarePCA:
"""
Principal Components Analysis with explicit validation that dimensionality
reduction preserves information equitably across patient populations.
"""
def __init__(self, n_components: Optional[int] = None):
"""
Initialize equity-aware PCA.
Args:
n_components: Number of components to retain. If None, keeps all components.
"""
self.n_components = n_components
self.components_: Optional[np.ndarray] = None
self.explained_variance_: Optional[np.ndarray] = None
self.mean_: Optional[np.ndarray] = None
self.group_reconstruction_errors_: Dict[str, float] = {}
logger.info(f"Initialized equity-aware PCA with n_components={n_components}")
def fit(
self,
X: np.ndarray,
demographic_labels: Optional[np.ndarray] = None
) -> 'EquityAwarePCA':
"""
Fit PCA on data with optional analysis of reconstruction quality across groups.
Args:
X: (n_samples, n_features) data matrix
demographic_labels: Optional (n_samples,) array of group labels
Returns:
Self for method chaining
"""
# Center the data
self.mean_ = np.mean(X, axis=0)
X_centered = X - self.mean_
# Compute covariance matrix
cov_matrix = (X_centered.T @ X_centered) / (X.shape[0] - 1)
# Eigendecomposition
# eigh is for symmetric matrices and returns eigenvalues in ascending order
eigenvalues, eigenvectors = eigh(cov_matrix)
# Sort in descending order of eigenvalues
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Store results
if self.n_components is not None:
self.components_ = eigenvectors[:, :self.n_components].T
self.explained_variance_ = eigenvalues[:self.n_components]
else:
self.components_ = eigenvectors.T
self.explained_variance_ = eigenvalues
# Analyze reconstruction quality across groups
if demographic_labels is not None:
self._analyze_group_reconstruction(X_centered, demographic_labels)
return self
def transform(self, X: np.ndarray) -> np.ndarray:
"""
Transform data to principal component space.
Args:
X: (n_samples, n_features) data matrix
Returns:
(n_samples, n_components) transformed data
"""
if self.components_ is None or self.mean_ is None:
raise ValueError("Must call fit() before transform()")
X_centered = X - self.mean_
return X_centered @ self.components_.T
def inverse_transform(self, X_transformed: np.ndarray) -> np.ndarray:
"""
Transform data back from principal component space to original space.
Args:
X_transformed: (n_samples, n_components) data in PC space
Returns:
(n_samples, n_features) reconstructed data
"""
if self.components_ is None or self.mean_ is None:
raise ValueError("Must call fit() before inverse_transform()")
return (X_transformed @ self.components_) + self.mean_
def _analyze_group_reconstruction(
self,
X_centered: np.ndarray,
demographic_labels: np.ndarray
) -> None:
"""
Analyze how well PCA reconstruction works for different demographic groups.
Poor reconstruction for some groups indicates that the principal components
don't capture the variation patterns in those groups well.
"""
unique_groups = np.unique(demographic_labels[~pd.isna(demographic_labels)])
# Transform and reconstruct data
X_transformed = X_centered @ self.components_.T
X_reconstructed = X_transformed @ self.components_
# Compute reconstruction error for each group
for group in unique_groups:
group_mask = demographic_labels == group
# Mean squared reconstruction error
reconstruction_error = np.mean(
np.sum((X_centered[group_mask] - X_reconstructed[group_mask])**2, axis=1)
)
self.group_reconstruction_errors_[str(group)] = reconstruction_error
# Check for disparities
errors = list(self.group_reconstruction_errors_.values())
if len(errors) > 1:
max_error = max(errors)
min_error = min(errors)
error_ratio = max_error / min_error if min_error > 0 else float('inf')
if error_ratio > 2.0:
logger.warning(
f"Large disparity in PCA reconstruction error across groups "
f"(ratio={error_ratio:.2f}). Principal components may not "
f"represent all populations equally well."
)
def get_explained_variance_ratio(self) -> np.ndarray:
"""
Get the proportion of variance explained by each principal component.
Returns:
Array of explained variance ratios
"""
if self.explained_variance_ is None:
raise ValueError("Must call fit() first")
return self.explained_variance_ / np.sum(self.explained_variance_)
def plot_scree(self) -> None:
"""
Create scree plot showing variance explained by each component.
This helps determine how many components to retain.
"""
import matplotlib.pyplot as plt
if self.explained_variance_ is None:
raise ValueError("Must call fit() first")
variance_ratios = self.get_explained_variance_ratio()
cumulative_variance = np.cumsum(variance_ratios)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# Individual variance
ax1.bar(range(1, len(variance_ratios) + 1), variance_ratios)
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Proportion of Variance Explained')
ax1.set_title('Scree Plot')
# Cumulative variance
ax2.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, 'bo-')
ax2.axhline(y=0.8, color='r', linestyle='--', label='80% threshold')
ax2.axhline(y=0.9, color='g', linestyle='--', label='90% threshold')
ax2.set_xlabel('Number of Components')
ax2.set_ylabel('Cumulative Variance Explained')
ax2.set_title('Cumulative Variance')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
def assess_equity_impact(self) -> Dict[str, any]:
"""
Comprehensive assessment of whether PCA affects demographic groups equitably.
Returns:
Dictionary with equity assessment results
"""
if not self.group_reconstruction_errors_:
logger.warning("No group reconstruction errors available. Run fit() with demographic_labels.")
return {}
errors = list(self.group_reconstruction_errors_.values())
assessment = {
'group_reconstruction_errors': self.group_reconstruction_errors_,
'max_error': max(errors),
'min_error': min(errors),
'error_ratio': max(errors) / min(errors) if min(errors) > 0 else float('inf'),
'error_cv': np.std(errors) / np.mean(errors),
'equity_concern_detected': False
}
# Flag equity concerns
if assessment['error_ratio'] > 2.0 or assessment['error_cv'] > 0.3:
assessment['equity_concern_detected'] = True
assessment['recommendation'] = (
"Consider group-specific PCA or alternative dimensionality "
"reduction methods that better preserve information for all populations."
)
else:
assessment['recommendation'] = (
"PCA appears to work reasonably consistently across groups."
)
return assessment
This equity-aware PCA implementation explicitly tracks reconstruction quality across demographic groups, which reveals when dimensionality reduction works better for some populations than others. The reconstruction error analysis provides an early warning that the principal components may not be appropriate for all populations, allowing us to either adjust the number of components retained or consider alternative dimensionality reduction approaches that better preserve information for minority populations.
2.2.4 Singular Value Decomposition and Matrix Factorization
Singular value decomposition provides an alternative perspective on matrix structure that generalizes eigendecomposition to non-square matrices and connects naturally to many machine learning methods including collaborative filtering, latent semantic analysis, and matrix completion. SVD factorizes a data matrix X into three matrices: X = UΣV^T, where U contains left singular vectors, Σ is a diagonal matrix of singular values, and V contains right singular vectors. This decomposition reveals latent structure in data and enables low-rank approximations that can denoise data or reduce dimensionality.
In healthcare applications, SVD and related matrix factorization methods are used for tasks including identifying patient subgroups based on similar clinical patterns, discovering latent disease phenotypes from symptom and biomarker data, and imputing missing values in partially observed clinical matrices. However, these methods face equity challenges similar to those we saw with PCA. The latent factors discovered by matrix factorization may primarily reflect patterns in well-represented populations while missing important patterns specific to minority populations. Moreover, when using matrix factorization for missing data imputation, the imputed values will reflect the patterns learned from observed data, which may not be appropriate if missingness patterns differ systematically across groups.
Let me implement SVD-based methods with explicit equity considerations:
from scipy.linalg import svd as scipy_svd
from sklearn.impute import IterativeImputer
class EquityAwareSVD:
"""
Singular Value Decomposition with equity-aware analysis and applications.
"""
def __init__(self, n_components: Optional[int] = None):
"""
Initialize equity-aware SVD.
Args:
n_components: Number of components to retain for low-rank approximation
"""
self.n_components = n_components
self.U_: Optional[np.ndarray] = None
self.singular_values_: Optional[np.ndarray] = None
self.Vt_: Optional[np.ndarray] = None
logger.info(f"Initialized equity-aware SVD with n_components={n_components}")
def fit(
self,
X: np.ndarray,
center: bool = True
) -> 'EquityAwareSVD':
"""
Compute SVD of data matrix.
Args:
X: (n_samples, n_features) data matrix
center: Whether to center data before SVD
Returns:
Self for method chaining
"""
if center:
self.mean_ = np.mean(X, axis=0)
X_centered = X - self.mean_
else:
self.mean_ = np.zeros(X.shape[1])
X_centered = X
# Compute full SVD
U, s, Vt = scipy_svd(X_centered, full_matrices=False)
# Retain specified number of components
if self.n_components is not None:
self.U_ = U[:, :self.n_components]
self.singular_values_ = s[:self.n_components]
self.Vt_ = Vt[:self.n_components, :]
else:
self.U_ = U
self.singular_values_ = s
self.Vt_ = Vt
return self
def transform(self, X: np.ndarray) -> np.ndarray:
"""
Project data into latent space.
Args:
X: (n_samples, n_features) data matrix
Returns:
(n_samples, n_components) latent representation
"""
if self.Vt_ is None or self.singular_values_ is None:
raise ValueError("Must call fit() first")
X_centered = X - self.mean_
return X_centered @ self.Vt_.T
def reconstruct(self, X: np.ndarray) -> np.ndarray:
"""
Reconstruct data from low-rank approximation.
Args:
X: (n_samples, n_features) original data matrix
Returns:
(n_samples, n_features) reconstructed data
"""
if self.U_ is None or self.singular_values_ is None or self.Vt_ is None:
raise ValueError("Must call fit() first")
X_centered = X - self.mean_
X_latent = self.transform(X)
X_reconstructed = X_latent @ self.Vt_
return X_reconstructed + self.mean_
def impute_missing_with_svd(
self,
X: np.ndarray,
demographic_labels: Optional[np.ndarray] = None,
max_iterations: int = 100,
tolerance: float = 1e-4
) -> Tuple[np.ndarray, Dict[str, any]]:
"""
Impute missing values using iterative SVD-based matrix completion.
This method alternates between SVD decomposition and missing value
imputation until convergence. With equity monitoring, it tracks whether
imputation quality differs across demographic groups.
Args:
X: (n_samples, n_features) data matrix with missing values (np.nan)
demographic_labels: Optional group labels for equity monitoring
max_iterations: Maximum iterations for convergence
tolerance: Convergence tolerance for change in imputed values
Returns:
Tuple of (imputed data matrix, imputation quality metrics)
"""
# Initialize missing values with column means
X_imputed = X.copy()
missing_mask = np.isnan(X)
for j in range(X.shape[1]):
col_mean = np.nanmean(X[:, j])
X_imputed[missing_mask[:, j], j] = col_mean
# Iterative refinement
for iteration in range(max_iterations):
X_prev = X_imputed.copy()
# Fit SVD on current imputed data
self.fit(X_imputed, center=True)
# Reconstruct and update missing values
X_reconstructed = self.reconstruct(X_imputed)
X_imputed[missing_mask] = X_reconstructed[missing_mask]
# Check convergence
change = np.sqrt(np.mean((X_imputed[missing_mask] - X_prev[missing_mask])**2))
if change < tolerance:
logger.info(f"SVD imputation converged after {iteration + 1} iterations")
break
# Assess imputation quality across groups
quality_metrics = {'converged': change < tolerance, 'final_change': change}
if demographic_labels is not None:
quality_metrics['group_imputation_quality'] = self._assess_imputation_quality(
X, X_imputed, missing_mask, demographic_labels
)
return X_imputed, quality_metrics
def _assess_imputation_quality(
self,
X_original: np.ndarray,
X_imputed: np.ndarray,
missing_mask: np.ndarray,
demographic_labels: np.ndarray
) -> Dict[str, any]:
"""
Assess whether imputation quality differs across demographic groups.
We can't directly assess accuracy since we don't know true values,
but we can check whether imputed values have reasonable statistical
properties within each group.
"""
unique_groups = np.unique(demographic_labels[~pd.isna(demographic_labels)])
quality_by_group = {}
for group in unique_groups:
group_mask = demographic_labels == group
# For observed values in this group, check distributional match
group_observed = X_original[group_mask & ~missing_mask]
group_imputed = X_imputed[group_mask & missing_mask]
if len(group_observed) > 0 and len(group_imputed) > 0:
# Compare means
observed_mean = np.nanmean(group_observed)
imputed_mean = np.mean(group_imputed)
mean_difference = abs(observed_mean - imputed_mean)
# Compare standard deviations
observed_std = np.nanstd(group_observed)
imputed_std = np.std(group_imputed)
std_ratio = imputed_std / observed_std if observed_std > 0 else 1.0
quality_by_group[str(group)] = {
'mean_difference': mean_difference,
'std_ratio': std_ratio,
'n_imputed': len(group_imputed),
'n_observed': len(group_observed)
}
return quality_by_group
class EquityAwareMatrixFactorization:
"""
Non-negative matrix factorization with equity considerations for
discovering latent patient subgroups or disease phenotypes.
"""
def __init__(
self,
n_components: int,
max_iterations: int = 200,
regularization: float = 0.01
):
"""
Initialize matrix factorization model.
Args:
n_components: Number of latent factors
max_iterations: Maximum iterations for optimization
regularization: L2 regularization strength
"""
self.n_components = n_components
self.max_iterations = max_iterations
self.regularization = regularization
self.W_: Optional[np.ndarray] = None # Patient factors
self.H_: Optional[np.ndarray] = None # Feature factors
logger.info(
f"Initialized matrix factorization with {n_components} components, "
f"regularization={regularization}"
)
def fit(
self,
X: np.ndarray,
demographic_labels: Optional[np.ndarray] = None
) -> 'EquityAwareMatrixFactorization':
"""
Fit non-negative matrix factorization with multiplicative updates.
Factorizes X ≈ WH where W is (n_samples, n_components) and
H is (n_components, n_features).
Args:
X: (n_samples, n_features) non-negative data matrix
demographic_labels: Optional group labels for equity monitoring
Returns:
Self for method chaining
"""
if np.any(X < 0):
raise ValueError("Non-negative matrix factorization requires non-negative data")
n_samples, n_features = X.shape
# Initialize factors with small random values
np.random.seed(42)
self.W_ = np.random.rand(n_samples, self.n_components) * 0.01
self.H_ = np.random.rand(self.n_components, n_features) * 0.01
# Multiplicative update algorithm
for iteration in range(self.max_iterations):
# Update H
numerator = self.W_.T @ X
denominator = self.W_.T @ (self.W_ @ self.H_) + self.regularization
self.H_ = self.H_ * (numerator / (denominator + 1e-10))
# Update W
numerator = X @ self.H_.T
denominator = (self.W_ @ self.H_) @ self.H_.T + self.regularization
self.W_ = self.W_ * (numerator / (denominator + 1e-10))
# Compute reconstruction error periodically
if iteration % 20 == 0:
reconstruction_error = np.linalg.norm(X - self.W_ @ self.H_, 'fro')
logger.debug(f"Iteration {iteration}, reconstruction error: {reconstruction_error:.4f}")
# Analyze whether factors are balanced across demographic groups
if demographic_labels is not None:
self._analyze_factor_distribution(demographic_labels)
return self
def transform(self, X: np.ndarray) -> np.ndarray:
"""
Project new data into latent factor space.
Args:
X: (n_samples, n_features) data matrix
Returns:
(n_samples, n_components) factor representation
"""
if self.H_ is None:
raise ValueError("Must call fit() first")
# Solve for W given X and H using multiplicative updates
W = np.random.rand(X.shape[0], self.n_components) * 0.01
for _ in range(50): # Fewer iterations for transform
numerator = X @ self.H_.T
denominator = (W @ self.H_) @ self.H_.T + self.regularization
W = W * (numerator / (denominator + 1e-10))
return W
def _analyze_factor_distribution(
self,
demographic_labels: np.ndarray
) -> None:
"""
Analyze whether latent factors are distributed equitably across
demographic groups or if certain factors are concentrated in
specific populations.
"""
unique_groups = np.unique(demographic_labels[~pd.isna(demographic_labels)])
for component in range(self.n_components):
factor_loadings = self.W_[:, component]
# Compute mean loading for each group
group_means = {}
for group in unique_groups:
group_mask = demographic_labels == group
group_means[str(group)] = np.mean(factor_loadings[group_mask])
# Check for concentration in specific groups
means = list(group_means.values())
if len(means) > 1:
cv = np.std(means) / np.mean(means)
if cv > 0.5:
logger.warning(
f"Factor {component} shows uneven distribution across groups "
f"(CV={cv:.3f}). This factor may primarily capture patterns "
f"from specific populations rather than being universally applicable."
)
These SVD and matrix factorization implementations include equity safeguards at multiple levels. The SVD-based imputation tracks imputation quality across demographic groups to detect when imputed values have different statistical properties for different populations. The matrix factorization analysis checks whether discovered latent factors are concentrated in specific demographic groups, which could indicate that the factorization is primarily capturing patterns from those groups while missing important structure in other populations.
2.3 Probability Theory for Clinical Decision Making
Probability theory provides the mathematical framework for reasoning about uncertainty, which is fundamental to clinical decision making where diagnoses are uncertain, prognoses are probabilistic, and treatment effects vary across individuals. However, applying probability theory to healthcare data requires careful attention to where the probabilities come from and what they represent. Probability distributions in healthcare often reflect social and historical processes rather than natural phenomena, and standard probabilistic assumptions can fail in ways that systematically disadvantage certain populations.
2.3.1 Probability Distributions and Clinical Data
A probability distribution describes the relative likelihood of different outcomes or the relative frequency with which different values of a variable occur. In healthcare applications, we use probability distributions to model diagnostic test results, disease prevalence in populations, treatment response rates, survival times, and many other uncertain quantities. However, the interpretation of these distributions requires understanding not just the mathematics but also the data generation process.
Consider a seemingly straightforward example: the distribution of blood pressure measurements in a patient population. We might model systolic blood pressure as following a normal distribution with a mean of 120 mmHg and a standard deviation of 15 mmHg. But this statistical model obscures important realities. The observed distribution reflects not just biological variation but also who has access to blood pressure measurement, how measurement protocols vary across clinical settings, whether measurements are taken in stressful environments like emergency departments or in calmer primary care settings, and systematic differences in measurement accuracy across patient body sizes and skin tones. The “true” distribution of underlying physiological blood pressure may be quite different from the observed distribution of recorded measurements.
The equity implications become clearer when we recognize that these measurement and selection processes are not uniform across populations. Patients with regular primary care access will have blood pressure distributions that reflect stable ongoing management, while patients who only interface with healthcare during acute illness will have distributions skewed toward higher values. Patients seen primarily in community health centers may have different measurement protocols than those seen in tertiary care centers. If we build prediction models or decision rules based on probability distributions estimated from one population and apply them to another, we may make systematically biased predictions.
Let me implement probability distribution modeling with explicit equity considerations:
from scipy import stats
from typing import Callable
class EquityAwareProbabilityModeling:
"""
Probability distribution fitting and analysis with explicit attention to
whether distributions differ systematically across patient populations.
"""
def __init__(self):
"""Initialize probability modeling framework."""
self.fitted_distributions_: Dict[str, Dict[str, any]] = {}
logger.info("Initialized equity-aware probability modeling")
def fit_distribution(
self,
data: np.ndarray,
distribution_type: str = 'normal',
demographic_labels: Optional[np.ndarray] = None,
group_specific: bool = False
) -> Dict[str, any]:
"""
Fit probability distribution to data with optional group-specific fitting.
Args:
data: 1D array of observed values
distribution_type: Type of distribution ('normal', 'lognormal', 'gamma', 'beta')
demographic_labels: Optional group labels
group_specific: Whether to fit separate distributions for each group
Returns:
Dictionary with fitted parameters and diagnostic information
"""
# Remove missing values
data_clean = data[~np.isnan(data)]
if len(data_clean) == 0:
raise ValueError("No non-missing data to fit distribution")
results = {
'distribution_type': distribution_type,
'group_specific': group_specific,
'global_fit': None,
'group_fits': {},
'distribution_homogeneity_test': None
}
# Fit global distribution
if distribution_type == 'normal':
mu, sigma = stats.norm.fit(data_clean)
results['global_fit'] = {'mean': mu, 'std': sigma}
fitted_dist = stats.norm(mu, sigma)
elif distribution_type == 'lognormal':
# Data must be positive for lognormal
if np.any(data_clean <= 0):
raise ValueError("Lognormal distribution requires positive data")
shape, loc, scale = stats.lognorm.fit(data_clean)
results['global_fit'] = {'shape': shape, 'loc': loc, 'scale': scale}
fitted_dist = stats.lognorm(shape, loc, scale)
elif distribution_type == 'gamma':
if np.any(data_clean < 0):
raise ValueError("Gamma distribution requires non-negative data")
alpha, loc, beta = stats.gamma.fit(data_clean)
results['global_fit'] = {'alpha': alpha, 'loc': loc, 'beta': beta}
fitted_dist = stats.gamma(alpha, loc, beta)
else:
raise ValueError(f"Unknown distribution type: {distribution_type}")
# Assess goodness of fit
results['global_fit']['goodness_of_fit'] = self._assess_goodness_of_fit(
data_clean, fitted_dist
)
# Fit group-specific distributions if requested
if group_specific and demographic_labels is not None:
unique_groups = np.unique(demographic_labels[~pd.isna(demographic_labels)])
for group in unique_groups:
group_mask = demographic_labels == group
group_data = data[group_mask]
group_data_clean = group_data[~np.isnan(group_data)]
if len(group_data_clean) > 10: # Need sufficient data
if distribution_type == 'normal':
mu_g, sigma_g = stats.norm.fit(group_data_clean)
results['group_fits'][str(group)] = {
'mean': mu_g, 'std': sigma_g
}
# Similar for other distribution types...
# Test for distribution homogeneity across groups
results['distribution_homogeneity_test'] = self._test_distribution_homogeneity(
data, demographic_labels, distribution_type
)
return results
def _assess_goodness_of_fit(
self,
data: np.ndarray,
fitted_distribution
) -> Dict[str, float]:
"""
Assess how well the fitted distribution matches the observed data.
Uses Kolmogorov-Smirnov test and Anderson-Darling test.
"""
# Kolmogorov-Smirnov test
ks_statistic, ks_pvalue = stats.kstest(data, fitted_distribution.cdf)
# Q-Q plot correlation
theoretical_quantiles = fitted_distribution.ppf(
np.linspace(0.01, 0.99, len(data))
)
sample_quantiles = np.sort(data)
qq_correlation = np.corrcoef(theoretical_quantiles, sample_quantiles)[0, 1]
goodness_of_fit = {
'ks_statistic': ks_statistic,
'ks_pvalue': ks_pvalue,
'qq_correlation': qq_correlation
}
if ks_pvalue < 0.05:
logger.warning(
f"Poor goodness of fit (KS p-value = {ks_pvalue:.4f}). "
f"The assumed distribution may not be appropriate for this data."
)
return goodness_of_fit
def _test_distribution_homogeneity(
self,
data: np.ndarray,
demographic_labels: np.ndarray,
distribution_type: str
) -> Dict[str, any]:
"""
Test whether the distribution is homogeneous across demographic groups.
Uses Kolmogorov-Smirnov test for comparing distributions.
"""
unique_groups = np.unique(demographic_labels[~pd.isna(demographic_labels)])
if len(unique_groups) < 2:
return {'test': 'insufficient_groups'}
# Pairwise KS tests
pairwise_tests = []
for i, group1 in enumerate(unique_groups):
for group2 in unique_groups[i+1:]:
mask1 = demographic_labels == group1
mask2 = demographic_labels == group2
data1 = data[mask1]
data2 = data[mask2]
data1_clean = data1[~np.isnan(data1)]
data2_clean = data2[~np.isnan(data2)]
if len(data1_clean) > 5 and len(data2_clean) > 5:
ks_stat, ks_pval = stats.ks_2samp(data1_clean, data2_clean)
pairwise_tests.append({
'group1': str(group1),
'group2': str(group2),
'ks_statistic': ks_stat,
'ks_pvalue': ks_pval,
'distributions_differ': ks_pval < 0.05
})
# Check if any pairs have significantly different distributions
n_different = sum(1 for test in pairwise_tests if test['distributions_differ'])
if n_different > 0:
logger.warning(
f"Distribution differs significantly across {n_different} group pairs. "
f"Using a single distribution for all groups may introduce bias."
)
return {
'test': 'ks_2sample',
'pairwise_comparisons': pairwise_tests,
'n_pairs_differ': n_different
}
def compute_conditional_probability(
self,
outcome: np.ndarray,
condition: np.ndarray,
outcome_value: any,
condition_value: any,
demographic_labels: Optional[np.ndarray] = None
) -> Dict[str, float]:
"""
Compute $P(outcome=outcome_value \mid condition=condition_value)$ with
optional stratification by demographic groups.
This is useful for computing diagnostic test sensitivities, disease
prevalence given risk factors, etc.
Args:
outcome: Array of outcome values
condition: Array of conditioning variable values
outcome_value: Specific outcome value of interest
condition_value: Specific condition value
demographic_labels: Optional group labels for stratified analysis
Returns:
Dictionary with conditional probabilities overall and by group
"""
# Overall conditional probability
condition_mask = condition == condition_value
n_condition = np.sum(condition_mask)
if n_condition == 0:
raise ValueError(f"No observations with condition={condition_value}")
outcome_and_condition_mask = (outcome == outcome_value) & condition_mask
n_outcome_and_condition = np.sum(outcome_and_condition_mask)
prob_overall = n_outcome_and_condition / n_condition
results = {
'overall_probability': prob_overall,
'n_condition': int(n_condition),
'n_outcome_and_condition': int(n_outcome_and_condition)
}
# Stratified probabilities
if demographic_labels is not None:
unique_groups = np.unique(demographic_labels[~pd.isna(demographic_labels)])
group_probabilities = {}
for group in unique_groups:
group_mask = demographic_labels == group
group_condition_mask = condition_mask & group_mask
n_group_condition = np.sum(group_condition_mask)
if n_group_condition > 0:
group_outcome_mask = outcome_and_condition_mask & group_mask
n_group_outcome = np.sum(group_outcome_mask)
group_prob = n_group_outcome / n_group_condition
group_probabilities[str(group)] = {
'probability': float(group_prob),
'n_condition': int(n_group_condition),
'n_outcome_and_condition': int(n_group_outcome)
}
results['group_probabilities'] = group_probabilities
# Test for homogeneity
probs = [v['probability'] for v in group_probabilities.values()]
if len(probs) > 1:
prob_range = max(probs) - min(probs)
if prob_range > 0.2:
logger.warning(
f"Conditional probability varies substantially across groups "
f"(range = {prob_range:.3f}). Pooled estimates may be misleading."
)
return results
This probability modeling implementation explicitly checks whether distributions are homogeneous across demographic groups and warns when pooled estimates may be inappropriate. The conditional probability calculation provides stratified estimates that reveal when relationships differ across populations, which is essential for avoiding biased predictions that assume universality when relationships are actually heterogeneous.
2.3.2 Bayes’ Theorem and Prior Probabilities in Healthcare
Bayes’ theorem provides the mathematical foundation for updating beliefs based on evidence, formally stating that the posterior probability of a hypothesis given observed data is proportional to the likelihood of the data given the hypothesis times the prior probability of the hypothesis. In healthcare, Bayesian reasoning appears in diagnostic test interpretation, where we update our assessment of disease probability based on test results, and in clinical prediction models that combine prior information with patient-specific features.
The mathematical statement is elegant: $P(\text{disease} \mid \text{test}+) = \dfrac{P(\text{test}+ \mid \text{disease}) \, P(\text{disease})}{P(\text{test}+)}$, where $P(\text{disease} \mid \text{test}+)$ is the posterior probability of disease given a positive test, $P(\text{test}+ \mid \text{disease})$ is the test sensitivity, $P(\text{disease})$ is the prior probability or prevalence, and $P(\text{test}+)$ is the marginal probability of a positive test. However, applying this formula in practice requires careful attention to where the prior probabilities come from and whether they are appropriate for the specific patient.
The health equity challenge is that prior probabilities often reflect historical patterns of disease diagnosis that may be systematically biased. If a condition has been historically underdiagnosed in certain populations due to lack of access to care, implicit bias in clinical evaluation, or other systemic factors, then prevalence estimates derived from diagnostic databases will underestimate true disease burden in those populations. Using these biased priors in Bayesian reasoning will then lead to systematic underdiagnosis that perpetuates the original inequity.
Consider screening for diabetic retinopathy in patients with diabetes. The prior probability of retinopathy depends on how long diabetes has been present, how well controlled blood sugar has been, and other factors. But it also depends on how consistently the patient has had access to diabetic care and whether previous retinal examinations were performed and documented. For a patient with intermittent healthcare access whose diabetes may have been present but undiagnosed for years, the standard prior probability based on documented diabetes duration will systematically underestimate retinopathy risk. A Bayesian diagnostic approach using that prior will then underweight positive screening results, potentially leading to delayed treatment.
Let me implement Bayesian reasoning with explicit attention to prior probability issues:
class EquityAwareBayesianReasoning:
"""
Bayesian inference with explicit attention to bias in prior probabilities
and sensitivity analysis for how results change with different priors.
"""
def __init__(self):
"""Initialize Bayesian reasoning framework."""
logger.info("Initialized equity-aware Bayesian reasoning")
def diagnostic_test_interpretation(
self,
test_result: bool,
sensitivity: float,
specificity: float,
prior_probability: float,
demographic_group: Optional[str] = None,
prior_confidence: str = 'medium'
) -> Dict[str, any]:
"""
Interpret diagnostic test result using Bayes' theorem with sensitivity
analysis for prior probability uncertainty.
Args:
test_result: Whether test was positive (True) or negative (False)
sensitivity: $P(test+ \mid disease)$
specificity: $P(test- \mid no disease)$
prior_probability: P(disease) before test
demographic_group: Optional group label for context
prior_confidence: How confident we are in prior ('high', 'medium', 'low')
Returns:
Dictionary with posterior probability and sensitivity analysis
"""
if not (0 <= sensitivity <= 1 and 0 <= specificity <= 1):
raise ValueError("Sensitivity and specificity must be between 0 and 1")
if not (0 <= prior_probability <= 1):
raise ValueError("Prior probability must be between 0 and 1")
# Compute posterior using Bayes' theorem
if test_result: # Positive test
# $P(disease \mid test+)$ = $P(test+ \mid disease)$ * P(disease) / P(test+)
# where $P(test+) = P(test+ \mid disease)$*$P(disease) + P(test+ \mid no disease)$*P(no disease)
p_test_pos = sensitivity * prior_probability + (1 - specificity) * (1 - prior_probability)
posterior = (sensitivity * prior_probability) / p_test_pos if p_test_pos > 0 else 0
else: # Negative test
# $P(disease \mid test-)$ = $P(test- \mid disease)$ * P(disease) / P(test-)
p_test_neg = (1 - sensitivity) * prior_probability + specificity * (1 - prior_probability)
posterior = ((1 - sensitivity) * prior_probability) / p_test_neg if p_test_neg > 0 else 0
results = {
'test_result': 'positive' if test_result else 'negative',
'prior_probability': prior_probability,
'posterior_probability': posterior,
'probability_change': posterior - prior_probability,
'demographic_group': demographic_group
}
# Sensitivity analysis for prior uncertainty
if prior_confidence == 'low':
prior_range = (max(0, prior_probability - 0.2), min(1, prior_probability + 0.2))
elif prior_confidence == 'medium':
prior_range = (max(0, prior_probability - 0.1), min(1, prior_probability + 0.1))
else: # high confidence
prior_range = (max(0, prior_probability - 0.05), min(1, prior_probability + 0.05))
# Compute posterior range
posterior_range = []
for prior in prior_range:
if test_result:
p_test_pos = sensitivity * prior + (1 - specificity) * (1 - prior)
post = (sensitivity * prior) / p_test_pos if p_test_pos > 0 else 0
else:
p_test_neg = (1 - sensitivity) * prior + specificity * (1 - prior)
post = ((1 - sensitivity) * prior) / p_test_neg if p_test_neg > 0 else 0
posterior_range.append(post)
results['sensitivity_analysis'] = {
'prior_confidence': prior_confidence,
'prior_range': prior_range,
'posterior_range': tuple(posterior_range)
}
# Interpretation guidance
uncertainty_width = posterior_range[1] - posterior_range[0]
if uncertainty_width > 0.2:
results['interpretation'] = (
f"Substantial uncertainty in posterior probability due to uncertain prior. "
f"Posterior could be anywhere from {posterior_range[0]:.3f} to {posterior_range[1]:.3f}. "
f"Consider additional information to refine prior estimate."
)
else:
results['interpretation'] = (
f"Posterior probability is {posterior:.3f}, representing "
f"{'an increase' if posterior > prior_probability else 'a decrease'} "
f"from prior of {prior_probability:.3f}."
)
return results
def adjust_prior_for_access_bias(
self,
observed_prevalence: float,
access_adjustment_factor: float
) -> Dict[str, float]:
"""
Adjust disease prevalence estimates for known access-to-care bias.
If certain populations have limited healthcare access, observed prevalence
may underestimate true disease burden. This function applies an adjustment
based on estimated diagnosis rates.
Args:
observed_prevalence: Disease prevalence in diagnosed population
access_adjustment_factor: Estimated fraction of cases that are diagnosed
(1.0 = all cases diagnosed, 0.5 = half diagnosed)
Returns:
Dictionary with adjusted prevalence and uncertainty bounds
"""
if not (0 < access_adjustment_factor <= 1.0):
raise ValueError("Access adjustment factor must be between 0 and 1")
# Adjusted prevalence assuming observed cases are a fraction of true cases
adjusted_prevalence = observed_prevalence / access_adjustment_factor
# Cap at 1.0 (can't exceed 100% prevalence)
adjusted_prevalence = min(adjusted_prevalence, 1.0)
# Uncertainty bounds - conservative estimates
lower_bound = observed_prevalence # At minimum, prevalence is observed rate
upper_bound = min(observed_prevalence / (access_adjustment_factor * 0.7), 1.0)
results = {
'observed_prevalence': observed_prevalence,
'adjusted_prevalence': adjusted_prevalence,
'adjustment_factor': access_adjustment_factor,
'uncertainty_bounds': (lower_bound, upper_bound),
'interpretation': (
f"Observed prevalence of {observed_prevalence:.3f} may underestimate "
f"true prevalence if only {access_adjustment_factor:.1%} of cases are diagnosed. "
f"Adjusted estimate is {adjusted_prevalence:.3f}, with true value likely "
f"between {lower_bound:.3f} and {upper_bound:.3f}."
)
}
return results
def compare_bayesian_vs_frequentist(
self,
observed_successes: int,
total_trials: int,
prior_alpha: float = 1.0,
prior_beta: float = 1.0,
demographic_group: Optional[str] = None
) -> Dict[str, any]:
"""
Compare Bayesian posterior with frequentist confidence interval for
probability estimation, showing how prior information affects inference.
Uses Beta-Binomial conjugate prior for Bayesian analysis.
Args:
observed_successes: Number of successful outcomes observed
total_trials: Total number of trials
prior_alpha: Prior Beta distribution alpha parameter (successes + 1)
prior_beta: Prior Beta distribution beta parameter (failures + 1)
demographic_group: Optional group label
Returns:
Dictionary comparing Bayesian and frequentist estimates
"""
if observed_successes > total_trials or observed_successes < 0:
raise ValueError("Observed successes must be between 0 and total_trials")
# Frequentist estimate: maximum likelihood (observed proportion)
freq_estimate = observed_successes / total_trials if total_trials > 0 else 0
# Frequentist confidence interval (Wilson score interval)
from statsmodels.stats.proportion import proportion_confint
freq_ci = proportion_confint(observed_successes, total_trials, alpha=0.05, method='wilson')
# Bayesian estimate: posterior mean of Beta distribution
posterior_alpha = prior_alpha + observed_successes
posterior_beta = prior_beta + (total_trials - observed_successes)
bayes_estimate = posterior_alpha / (posterior_alpha + posterior_beta)
# Bayesian credible interval (equal-tailed 95% interval)
bayes_ci = (
stats.beta.ppf(0.025, posterior_alpha, posterior_beta),
stats.beta.ppf(0.975, posterior_alpha, posterior_beta)
)
results = {
'observed_successes': observed_successes,
'total_trials': total_trials,
'observed_proportion': freq_estimate,
'frequentist': {
'estimate': freq_estimate,
'confidence_interval': freq_ci,
'interpretation': '95% confidence interval'
},
'bayesian': {
'prior': {'alpha': prior_alpha, 'beta': prior_beta},
'posterior': {'alpha': posterior_alpha, 'beta': posterior_beta},
'estimate': bayes_estimate,
'credible_interval': bayes_ci,
'interpretation': '95% credible interval'
},
'difference': bayes_estimate - freq_estimate,
'demographic_group': demographic_group
}
# Interpretation of prior influence
if abs(results['difference']) > 0.05:
results['prior_influence'] = (
f"Bayesian estimate ({bayes_estimate:.3f}) differs substantially from "
f"frequentist estimate ({freq_estimate:.3f}), indicating strong prior influence. "
f"With limited data, prior beliefs matter significantly."
)
else:
results['prior_influence'] = (
f"Bayesian and frequentist estimates are similar, indicating that observed "
f"data dominates prior beliefs with this sample size."
)
return results
This Bayesian reasoning implementation includes sensitivity analysis for prior probability uncertainty and explicit tools for adjusting priors when we suspect systematic bias in observed prevalence estimates. The comparison between Bayesian and frequentist approaches helps illustrate how prior information affects inference, which is particularly important when we have reason to question whether standard priors are appropriate for specific populations.
2.3.3 Conditional Independence and Graphical Models
Conditional independence is a critical concept that underlies much of probabilistic modeling in healthcare AI. Two variables X and Y are conditionally independent given Z if $P(X, Y \mid Z)$ = $P(X \mid Z)$ $P(Y \mid Z)$, meaning that once we know Z, learning X provides no additional information about Y. This concept enables us to build tractable probabilistic models of complex systems by decomposing joint probability distributions into products of simpler conditional distributions.
However, conditional independence assumptions often fail in healthcare data in ways that reflect social and structural confounding rather than measurement error. Consider predicting hospital readmission using clinical variables like prior hospitalizations, chronic conditions, and functional status. Standard modeling approaches might assume that these clinical variables are conditionally independent given some latent health status. But in reality, many clinical variables are jointly influenced by social determinants like housing stability, food security, and access to primary care. Unmeasured confounders like structural racism affect multiple variables simultaneously, violating conditional independence assumptions in systematic ways.
The implications for model fairness are profound. When we build predictive models that assume conditional independence but the assumption fails due to unmeasured confounding that differs across populations, the resulting predictions will be systematically biased. The bias won’t necessarily be detected by standard model validation because the conditional independence violations are embedded in the training data structure.
Let me implement tools for testing and working with conditional independence assumptions:
from scipy.stats import chi2_contingency, pearsonr
from itertools import combinations
class ConditionalIndependenceAnalysis:
"""
Tools for testing conditional independence assumptions and detecting
violations that may indicate confounding or model misspecification.
"""
def __init__(self):
"""Initialize conditional independence analysis framework."""
logger.info("Initialized conditional independence analysis")
def test_conditional_independence(
self,
X: np.ndarray,
Y: np.ndarray,
Z: np.ndarray,
test_type: str = 'partial_correlation'
) -> Dict[str, any]:
"""
Test whether X and Y are conditionally independent given Z.
Multiple test types are provided because appropriate tests depend on
variable types (continuous, categorical) and distributional assumptions.
Args:
X: First variable (n_samples,)
Y: Second variable (n_samples,)
Z: Conditioning variable(s) (n_samples,) or (n_samples, n_features)
test_type: Type of test ('partial_correlation', 'cmi', 'stratified')
Returns:
Dictionary with test results and interpretation
"""
if len(X) != len(Y) or len(X) != len(Z.shape[0] if Z.ndim > 1 else len(Z)):
raise ValueError("X, Y, and Z must have same number of samples")
results = {'test_type': test_type}
if test_type == 'partial_correlation':
# Compute partial correlation between X and Y given Z
partial_corr, pvalue = self._partial_correlation(X, Y, Z)
results['partial_correlation'] = partial_corr
results['pvalue'] = pvalue
results['conditionally_independent'] = pvalue > 0.05
results['interpretation'] = (
f"Partial correlation is {partial_corr:.4f} (p={pvalue:.4f}). "
f"{'Evidence of conditional independence.' if pvalue > 0.05 else 'Conditional dependence detected.'}"
)
elif test_type == 'stratified':
# Stratify by Z and test independence within strata
results.update(self._stratified_independence_test(X, Y, Z))
else:
raise ValueError(f"Unknown test type: {test_type}")
return results
def _partial_correlation(
self,
X: np.ndarray,
Y: np.ndarray,
Z: np.ndarray
) -> Tuple[float, float]:
"""
Compute partial correlation between X and Y controlling for Z.
This is done by regressing X on Z and Y on Z, then computing the
correlation between the residuals.
"""
from sklearn.linear_model import LinearRegression
# Ensure Z is 2D
if Z.ndim == 1:
Z = Z.reshape(-1, 1)
# Regress X on Z
model_x = LinearRegression()
model_x.fit(Z, X)
residuals_x = X - model_x.predict(Z)
# Regress Y on Z
model_y = LinearRegression()
model_y.fit(Z, Y)
residuals_y = Y - model_y.predict(Z)
# Correlation of residuals is partial correlation
partial_corr, pvalue = pearsonr(residuals_x, residuals_y)
return partial_corr, pvalue
def _stratified_independence_test(
self,
X: np.ndarray,
Y: np.ndarray,
Z: np.ndarray
) -> Dict[str, any]:
"""
Test independence of X and Y within each stratum defined by Z.
Appropriate when Z is categorical.
"""
if Z.ndim > 1:
raise ValueError("Stratified test requires Z to be 1-dimensional")
unique_strata = np.unique(Z[~np.isnan(Z)])
stratum_results = []
for stratum in unique_strata:
mask = Z == stratum
X_stratum = X[mask]
Y_stratum = Y[mask]
if len(X_stratum) > 5: # Need minimum sample size
# Compute correlation within stratum
corr, pval = pearsonr(X_stratum, Y_stratum)
stratum_results.append({
'stratum': stratum,
'n': len(X_stratum),
'correlation': corr,
'pvalue': pval,
'independent': pval > 0.05
})
# Overall assessment
n_independent = sum(1 for r in stratum_results if r['independent'])
return {
'stratum_results': stratum_results,
'n_strata': len(stratum_results),
'n_strata_independent': n_independent,
'conditionally_independent': n_independent == len(stratum_results),
'interpretation': (
f"X and Y are {'conditionally independent' if n_independent == len(stratum_results) else 'conditionally dependent'} "
f"given Z. Independence holds in {n_independent}/{len(stratum_results)} strata."
)
}
def detect_confounding_structure(
self,
data: pd.DataFrame,
outcome_var: str,
predictor_vars: List[str],
potential_confounders: List[str],
demographic_var: Optional[str] = None
) -> Dict[str, any]:
"""
Detect potential confounding structure in predictive relationships.
Tests whether relationships between predictors and outcome change
substantially when controlling for potential confounders, which
suggests confounding.
Args:
data: DataFrame with all variables
outcome_var: Name of outcome variable
predictor_vars: Names of predictor variables
potential_confounders: Names of potential confounding variables
demographic_var: Optional demographic variable for stratified analysis
Returns:
Dictionary with confounding analysis results
"""
from sklearn.linear_model import LinearRegression
results = {
'outcome': outcome_var,
'confounding_detected': {},
'demographic_heterogeneity': {}
}
Y = data[outcome_var].values
for predictor in predictor_vars:
X = data[[predictor]].values
# Marginal association (without controlling for confounders)
model_marginal = LinearRegression()
model_marginal.fit(X, Y)
coef_marginal = model_marginal.coef_[0]
# Conditional association (controlling for confounders)
X_with_confounders = data[[predictor] + potential_confounders].values
model_conditional = LinearRegression()
model_conditional.fit(X_with_confounders, Y)
coef_conditional = model_conditional.coef_[0]
# Compute percent change in coefficient
pct_change = 100 * abs(coef_conditional - coef_marginal) / abs(coef_marginal) if coef_marginal != 0 else float('inf')
confounding_detected = pct_change > 10 # >10% change suggests confounding
results['confounding_detected'][predictor] = {
'marginal_coefficient': coef_marginal,
'conditional_coefficient': coef_conditional,
'percent_change': pct_change,
'confounding_present': confounding_detected
}
if confounding_detected:
logger.warning(
f"Confounding detected for {predictor}. Coefficient changed by "
f"{pct_change:.1f}% when controlling for {', '.join(potential_confounders)}."
)
# Check if confounding structure differs by demographic group
if demographic_var and demographic_var in data.columns:
group_confounding = {}
for group in data[demographic_var].unique():
if pd.notna(group):
group_data = data[data[demographic_var] == group]
if len(group_data) > 20: # Need sufficient sample size
Y_group = group_data[outcome_var].values
X_group = group_data[[predictor]].values
X_conf_group = group_data[[predictor] + potential_confounders].values
model_marg_group = LinearRegression()
model_marg_group.fit(X_group, Y_group)
coef_marg_group = model_marg_group.coef_[0]
model_cond_group = LinearRegression()
model_cond_group.fit(X_conf_group, Y_group)
coef_cond_group = model_cond_group.coef_[0]
pct_change_group = 100 * abs(coef_cond_group - coef_marg_group) / abs(coef_marg_group) if coef_marg_group != 0 else 0
group_confounding[str(group)] = {
'percent_change': pct_change_group,
'confounding_present': pct_change_group > 10
}
results['demographic_heterogeneity'][predictor] = group_confounding
# Check for heterogeneity
changes = [v['percent_change'] for v in group_confounding.values()]
if len(changes) > 1 and max(changes) - min(changes) > 20:
logger.warning(
f"Confounding structure for {predictor} differs substantially "
f"across demographic groups. Single model may be inappropriate."
)
return results
This conditional independence framework provides tools for detecting when standard modeling assumptions break down, with particular attention to whether violations differ across demographic groups. When we detect confounding that varies by demographics, it suggests that models built on the pooled data may systematically misrepresent relationships for some populations.
2.4 Statistical Inference with Equity Considerations
Statistical inference provides methods for drawing conclusions about populations from samples, quantifying uncertainty in estimates, and testing hypotheses about relationships in data. However, standard inferential methods make assumptions about random sampling, independent observations, and homogeneous populations that frequently fail in healthcare data. When these assumptions fail differently for different patient populations, standard inference can lead to systematically biased conclusions.
2.4.1 Sampling Bias and Selection Effects
The foundation of statistical inference is the assumption that our sample is representative of the population we want to draw conclusions about. In healthcare research, this assumption is routinely violated because healthcare data is fundamentally observational rather than randomly sampled. Patients appear in datasets because they sought care, had insurance coverage, lived near a healthcare facility, and were documented in an electronic health record system. Each of these selection steps can introduce bias.
The equity implications are severe. Underserved populations face systematic barriers to healthcare access, meaning they are systematically underrepresented in healthcare datasets. When they do appear in datasets, it is often in crisis situations like emergency department visits rather than in longitudinal primary care where comprehensive data accrues. The result is that statistical estimates derived from healthcare data systematically misrepresent the health status and healthcare needs of underserved populations.
Consider estimating the prevalence of uncontrolled hypertension in a population using electronic health record data. The naive estimate would simply compute the proportion of patients with most recent blood pressure measurements above target values. But this estimate is biased because it only includes patients who had blood pressure measured, and blood pressure measurement rates differ systematically across populations based on access to care. Patients with good healthcare access get regular blood pressure monitoring and early treatment, while patients with limited access may only have blood pressure measured when acutely ill or in emergency settings. The observed prevalence of uncontrolled hypertension will therefore be higher in populations with limited access, not necessarily because true hypertension rates are higher but because we are selectively sampling individuals at times when their blood pressure is more likely to be elevated.
Let me implement inference methods that account for selection bias:
class EquityAwareInference:
"""
Statistical inference methods with explicit accounting for selection bias
and differential sampling across populations.
"""
def __init__(self):
"""Initialize inference framework."""
logger.info("Initialized equity-aware statistical inference")
def estimate_prevalence_with_selection(
self,
outcome: np.ndarray,
selected: np.ndarray,
demographic_labels: Optional[np.ndarray] = None,
inverse_probability_weights: Optional[np.ndarray] = None
) -> Dict[str, any]:
"""
Estimate disease prevalence accounting for selection bias.
Uses inverse probability weighting to adjust for differential selection
probabilities across populations.
Args:
outcome: Binary outcome indicating disease presence (1) or absence (0)
selected: Binary indicator of whether individual was selected into sample
demographic_labels: Optional group labels
inverse_probability_weights: Optional pre-computed weights
If None, assumes equal selection probabilities within groups
Returns:
Dictionary with naive and adjusted prevalence estimates
"""
# Naive estimate using only selected sample
selected_outcome = outcome[selected == 1]
naive_prevalence = np.mean(selected_outcome)
results = {
'naive_prevalence': naive_prevalence,
'n_selected': int(np.sum(selected)),
'n_total': len(outcome)
}
# Adjusted estimate using inverse probability weighting
if inverse_probability_weights is not None:
# Weight each observation by inverse of selection probability
weighted_sum = np.sum(outcome[selected == 1] * inverse_probability_weights[selected == 1])
weight_sum = np.sum(inverse_probability_weights[selected == 1])
adjusted_prevalence = weighted_sum / weight_sum
results['adjusted_prevalence'] = adjusted_prevalence
results['selection_bias'] = adjusted_prevalence - naive_prevalence
# Stratified analysis
if demographic_labels is not None:
unique_groups = np.unique(demographic_labels[~pd.isna(demographic_labels)])
group_estimates = {}
for group in unique_groups:
group_mask = demographic_labels == group
group_selected = selected[group_mask]
group_outcome = outcome[group_mask]
# Selection rate for this group
selection_rate = np.mean(group_selected)
# Naive prevalence in selected sample
if np.sum(group_selected) > 0:
group_selected_outcome = group_outcome[group_selected == 1]
group_naive_prev = np.mean(group_selected_outcome)
group_estimates[str(group)] = {
'selection_rate': selection_rate,
'naive_prevalence': group_naive_prev,
'n_selected': int(np.sum(group_selected))
}
results['group_estimates'] = group_estimates
# Check for differential selection
selection_rates = [v['selection_rate'] for v in group_estimates.values()]
if len(selection_rates) > 1 and max(selection_rates) / min(selection_rates) > 2:
logger.warning(
f"Substantial differential selection across groups. "
f"Selection rates vary from {min(selection_rates):.3f} to {max(selection_rates):.3f}. "
f"Naive estimates may be severely biased."
)
return results
def confidence_interval_stratified(
self,
data: np.ndarray,
demographic_labels: np.ndarray,
confidence_level: float = 0.95
) -> Dict[str, any]:
"""
Compute confidence intervals for population mean with stratification
by demographic groups.
Provides both overall and group-specific estimates with appropriate
standard errors that account for potential heterogeneity.
Args:
data: Continuous outcome variable
demographic_labels: Group labels for stratification
confidence_level: Confidence level (default 0.95 for 95% CI)
Returns:
Dictionary with confidence intervals overall and by group
"""
from scipy import stats
alpha = 1 - confidence_level
# Overall estimate
data_clean = data[~np.isnan(data)]
n = len(data_clean)
mean = np.mean(data_clean)
se = np.std(data_clean, ddof=1) / np.sqrt(n)
t_crit = stats.t.ppf(1 - alpha/2, df=n-1)
ci = (mean - t_crit * se, mean + t_crit * se)
results = {
'overall': {
'mean': mean,
'standard_error': se,
'confidence_interval': ci,
'n': n
},
'groups': {}
}
# Group-specific estimates
unique_groups = np.unique(demographic_labels[~pd.isna(demographic_labels)])
for group in unique_groups:
group_mask = demographic_labels == group
group_data = data[group_mask]
group_data_clean = group_data[~np.isnan(group_data)]
if len(group_data_clean) > 1:
n_group = len(group_data_clean)
mean_group = np.mean(group_data_clean)
se_group = np.std(group_data_clean, ddof=1) / np.sqrt(n_group)
t_crit_group = stats.t.ppf(1 - alpha/2, df=n_group-1)
ci_group = (mean_group - t_crit_group * se_group,
mean_group + t_crit_group * se_group)
results['groups'][str(group)] = {
'mean': mean_group,
'standard_error': se_group,
'confidence_interval': ci_group,
'n': n_group
}
# Test for heterogeneity across groups
group_means = [v['mean'] for v in results['groups'].values()]
if len(group_means) > 1:
mean_range = max(group_means) - min(group_means)
overall_se = results['overall']['standard_error']
# Simple heterogeneity indicator
heterogeneity_detected = mean_range > 2 * overall_se
results['heterogeneity_analysis'] = {
'mean_range': mean_range,
'range_in_se_units': mean_range / overall_se if overall_se > 0 else float('inf'),
'heterogeneity_detected': heterogeneity_detected
}
if heterogeneity_detected:
logger.warning(
f"Substantial heterogeneity detected across groups. "
f"Overall confidence interval may not represent all populations well."
)
return results
def hypothesis_test_with_multiple_comparisons(
self,
groups: Dict[str, np.ndarray],
test_type: str = 'anova',
adjustment_method: str = 'bonferroni'
) -> Dict[str, any]:
"""
Conduct hypothesis tests across multiple demographic groups with
appropriate correction for multiple comparisons.
Args:
groups: Dictionary mapping group names to data arrays
test_type: Type of test ('anova', 'kruskal')
adjustment_method: Method for multiple comparison adjustment
('bonferroni', 'holm', 'fdr_bh')
Returns:
Dictionary with test results and adjusted p-values
"""
from scipy.stats import f_oneway, kruskal
from statsmodels.stats.multitest import multipletests
if len(groups) < 2:
raise ValueError("Need at least 2 groups for hypothesis testing")
# Overall test
group_data = [data[~np.isnan(data)] for data in groups.values()]
if test_type == 'anova':
# One-way ANOVA
statistic, pvalue = f_oneway(*group_data)
test_name = "One-way ANOVA"
elif test_type == 'kruskal':
# Kruskal-Wallis test (non-parametric alternative to ANOVA)
statistic, pvalue = kruskal(*group_data)
test_name = "Kruskal-Wallis test"
else:
raise ValueError(f"Unknown test type: {test_type}")
results = {
'overall_test': {
'test_name': test_name,
'statistic': statistic,
'pvalue': pvalue,
'reject_null': pvalue < 0.05
},
'pairwise_comparisons': []
}
# Pairwise comparisons if overall test is significant
if pvalue < 0.05:
group_names = list(groups.keys())
pairwise_pvalues = []
for i, name1 in enumerate(group_names):
for name2 in group_names[i+1:]:
data1 = groups[name1][~np.isnan(groups[name1])]
data2 = groups[name2][~np.isnan(groups[name2])]
# T-test for pairwise comparison
stat, pval = stats.ttest_ind(data1, data2)
results['pairwise_comparisons'].append({
'group1': name1,
'group2': name2,
'statistic': stat,
'pvalue_unadjusted': pval
})
pairwise_pvalues.append(pval)
# Adjust for multiple comparisons
if pairwise_pvalues:
reject, pvals_adjusted, _, _ = multipletests(
pairwise_pvalues,
alpha=0.05,
method=adjustment_method
)
for i, comparison in enumerate(results['pairwise_comparisons']):
comparison['pvalue_adjusted'] = pvals_adjusted[i]
comparison['reject_null_adjusted'] = reject[i]
results['adjustment_method'] = adjustment_method
return results
This inference framework explicitly accounts for selection bias and provides tools for stratified analysis that surface when overall estimates don’t represent all populations well. The multiple comparison adjustments are essential when conducting numerous group comparisons to avoid false discoveries.
2.4.2 Bootstrapping and Resampling Methods
Bootstrap resampling provides a powerful approach to statistical inference that makes minimal distributional assumptions. By repeatedly sampling with replacement from the observed data and computing statistics on each bootstrap sample, we can empirically estimate the sampling distribution of any statistic and construct confidence intervals without assuming normality or other specific distributional forms. However, standard bootstrap methods can fail when data has complex structure or when we need to respect correlations and clustering in the data.
In healthcare applications serving diverse populations, we often want to ensure that bootstrap samples maintain appropriate representation of demographic groups and preserve within-group correlation structures. Standard bootstrapping that samples individuals independently may produce bootstrap samples with poor representation of minority populations by chance, leading to unstable estimates of group-specific parameters.
Let me implement equity-aware bootstrap methods:
class EquityAwareBootstrap:
"""
Bootstrap resampling methods that maintain appropriate representation
of demographic groups and respect data structure.
"""
def __init__(self, n_bootstrap: int = 1000, random_seed: int = 42):
"""
Initialize bootstrap framework.
Args:
n_bootstrap: Number of bootstrap samples to generate
random_seed: Random seed for reproducibility
"""
self.n_bootstrap = n_bootstrap
self.random_seed = random_seed
np.random.seed(random_seed)
logger.info(f"Initialized bootstrap with {n_bootstrap} samples")
def stratified_bootstrap(
self,
data: np.ndarray,
demographic_labels: np.ndarray,
statistic_func: Callable[[np.ndarray], float]
) -> Dict[str, any]:
"""
Perform stratified bootstrap that maintains demographic group proportions.
Standard bootstrap can under-represent minority groups by chance.
Stratified bootstrap samples within each group to maintain representation.
Args:
data: Data array
demographic_labels: Group labels for stratification
statistic_func: Function that computes statistic from data array
Returns:
Dictionary with bootstrap distribution and confidence intervals
"""
unique_groups = np.unique(demographic_labels[~pd.isna(demographic_labels)])
# Compute observed statistic
data_clean = data[~np.isnan(data)]
observed_statistic = statistic_func(data_clean)
# Bootstrap distribution
bootstrap_statistics = []
for b in range(self.n_bootstrap):
# Stratified sampling
bootstrap_sample = []
for group in unique_groups:
group_mask = demographic_labels == group
group_data = data[group_mask]
group_data_clean = group_data[~np.isnan(group_data)]
if len(group_data_clean) > 0:
# Sample with replacement within group
group_bootstrap = np.random.choice(
group_data_clean,
size=len(group_data_clean),
replace=True
)
bootstrap_sample.extend(group_bootstrap)
# Compute statistic on bootstrap sample
bootstrap_sample = np.array(bootstrap_sample)
bootstrap_stat = statistic_func(bootstrap_sample)
bootstrap_statistics.append(bootstrap_stat)
bootstrap_statistics = np.array(bootstrap_statistics)
# Confidence intervals
ci_lower = np.percentile(bootstrap_statistics, 2.5)
ci_upper = np.percentile(bootstrap_statistics, 97.5)
results = {
'observed_statistic': observed_statistic,
'bootstrap_mean': np.mean(bootstrap_statistics),
'bootstrap_std': np.std(bootstrap_statistics),
'confidence_interval_95': (ci_lower, ci_upper),
'bootstrap_distribution': bootstrap_statistics
}
return results
def bootstrap_group_differences(
self,
data: np.ndarray,
demographic_labels: np.ndarray,
group1: str,
group2: str
) -> Dict[str, any]:
"""
Bootstrap confidence interval for difference between two groups.
Provides inference about whether observed differences between groups
are statistically significant.
Args:
data: Data array
demographic_labels: Group labels
group1: Name of first group
group2: Name of second group
Returns:
Dictionary with difference estimate and confidence interval
"""
# Extract group data
mask1 = demographic_labels == group1
mask2 = demographic_labels == group2
data1 = data[mask1]
data2 = data[mask2]
data1_clean = data1[~np.isnan(data1)]
data2_clean = data2[~np.isnan(data2)]
# Observed difference
mean1 = np.mean(data1_clean)
mean2 = np.mean(data2_clean)
observed_difference = mean1 - mean2
# Bootstrap distribution of difference
bootstrap_differences = []
for b in range(self.n_bootstrap):
# Bootstrap sample from each group
boot1 = np.random.choice(data1_clean, size=len(data1_clean), replace=True)
boot2 = np.random.choice(data2_clean, size=len(data2_clean), replace=True)
boot_diff = np.mean(boot1) - np.mean(boot2)
bootstrap_differences.append(boot_diff)
bootstrap_differences = np.array(bootstrap_differences)
# Confidence interval for difference
ci_lower = np.percentile(bootstrap_differences, 2.5)
ci_upper = np.percentile(bootstrap_differences, 97.5)
# P-value (proportion of bootstrap samples with difference crossing zero)
if observed_difference > 0:
pvalue = 2 * np.mean(bootstrap_differences <= 0)
else:
pvalue = 2 * np.mean(bootstrap_differences >= 0)
pvalue = min(pvalue, 1.0) # Cap at 1.0
results = {
'group1': group1,
'group2': group2,
'mean_group1': mean1,
'mean_group2': mean2,
'observed_difference': observed_difference,
'confidence_interval_95': (ci_lower, ci_upper),
'pvalue': pvalue,
'significant': ci_lower * ci_upper > 0 # CI doesn't contain zero
}
return results
def bootstrap_correlation(
self,
X: np.ndarray,
Y: np.ndarray,
demographic_labels: Optional[np.ndarray] = None
) -> Dict[str, any]:
"""
Bootstrap confidence interval for correlation coefficient.
With optional stratified analysis to detect whether correlation
differs across demographic groups.
Args:
X: First variable
Y: Second variable
demographic_labels: Optional group labels for stratified analysis
Returns:
Dictionary with correlation estimate and confidence interval
"""
# Remove missing values
mask = ~(np.isnan(X) | np.isnan(Y))
X_clean = X[mask]
Y_clean = Y[mask]
# Observed correlation
observed_corr = np.corrcoef(X_clean, Y_clean)[0, 1]
# Bootstrap distribution
bootstrap_corrs = []
for b in range(self.n_bootstrap):
# Sample pairs jointly to preserve pairing
indices = np.random.choice(len(X_clean), size=len(X_clean), replace=True)
boot_corr = np.corrcoef(X_clean[indices], Y_clean[indices])[0, 1]
bootstrap_corrs.append(boot_corr)
bootstrap_corrs = np.array(bootstrap_corrs)
ci_lower = np.percentile(bootstrap_corrs, 2.5)
ci_upper = np.percentile(bootstrap_corrs, 97.5)
results = {
'observed_correlation': observed_corr,
'confidence_interval_95': (ci_lower, ci_upper),
'bootstrap_std': np.std(bootstrap_corrs)
}
# Stratified analysis
if demographic_labels is not None:
demographic_clean = demographic_labels[mask]
unique_groups = np.unique(demographic_clean[~pd.isna(demographic_clean)])
group_correlations = {}
for group in unique_groups:
group_mask = demographic_clean == group
X_group = X_clean[group_mask]
Y_group = Y_clean[group_mask]
if len(X_group) > 10:
# Bootstrap correlation for this group
group_boot_corrs = []
for b in range(self.n_bootstrap):
indices = np.random.choice(len(X_group), size=len(X_group), replace=True)
boot_corr = np.corrcoef(X_group[indices], Y_group[indices])[0, 1]
group_boot_corrs.append(boot_corr)
group_boot_corrs = np.array(group_boot_corrs)
group_correlations[str(group)] = {
'observed_correlation': np.corrcoef(X_group, Y_group)[0, 1],
'confidence_interval_95': (
np.percentile(group_boot_corrs, 2.5),
np.percentile(group_boot_corrs, 97.5)
),
'n': len(X_group)
}
results['group_correlations'] = group_correlations
# Check for heterogeneity
corrs = [v['observed_correlation'] for v in group_correlations.values()]
if len(corrs) > 1 and max(corrs) - min(corrs) > 0.3:
logger.warning(
f"Correlation varies substantially across groups "
f"(range = {max(corrs) - min(corrs):.3f}). "
f"Overall correlation may not represent all populations."
)
return results
These bootstrap methods maintain demographic representation and provide group-stratified inference that surfaces heterogeneity. The stratified bootstrap ensures that minority populations don’t get under-represented by chance in bootstrap samples, which would lead to unstable estimates of their parameters.
2.5 Conclusion and Key Takeaways
This chapter has developed mathematical foundations for healthcare AI with sustained attention to how mathematical choices interact with health equity considerations. We’ve seen that mathematics in healthcare is never purely technical but rather embeds assumptions and value judgments that have differential impacts across patient populations. Linear algebra operations like standardization and dimensionality reduction can obscure or reveal health disparities depending on whether they preserve within-group variation structures. Probability distributions reflect not just biological variation but also social processes of healthcare access and measurement. Statistical inference methods make sampling assumptions that are routinely violated in healthcare data in ways that systematically disadvantage underserved populations.
The key insight is that achieving health equity through AI requires not just applying sophisticated mathematical methods but rather maintaining critical awareness of when mathematical assumptions break down and how these failures affect different populations differently. Every implementation in this chapter includes explicit checks for whether mathematical assumptions hold uniformly across demographic groups, sensitivity analyses that quantify uncertainty when assumptions are questionable, and stratified analyses that surface heterogeneity that pooled analyses would obscure.
The chapters that follow build on these mathematical foundations to develop machine learning and deep learning methods for healthcare applications. Throughout, we maintain the principle established here: mathematical sophistication must be paired with critical awareness of social context. The most elegant mathematical formulation is worthless if it produces biased predictions that widen rather than narrow health disparities. Success in healthcare AI requires both technical excellence and sustained commitment to equity.
Bibliography
Agarwal, A., Beygelzimer, A., Dudík, M., Langford, J., & Wallach, H. (2018). A reductions approach to fair classification. Proceedings of the 35th International Conference on Machine Learning, 80, 60-69. http://proceedings.mlr.press/v80/agarwal18a.html
Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and Machine Learning: Limitations and Opportunities. MIT Press. http://www.fairmlbook.org
Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., & Herrera, F. (2020). Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58, 82-115. https://doi.org/10.1016/j.inffus.2019.12.012
Bayesian, M. A., Ghorbani, A., & Zou, J. (2019). Data Shapley: Equitable valuation of data for machine learning. Proceedings of the 36th International Conference on Machine Learning, 97, 2242-2251. http://proceedings.mlr.press/v97/ghorbani19c.html
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of Machine Learning Research, 81, 1-15. http://proceedings.mlr.press/v81/buolamwini18a.html
Calders, T., & Verwer, S. (2010). Three naive Bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery, 21(2), 277-292. https://doi.org/10.1007/s10618-010-0190-x
Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., & Elhadad, N. (2015). Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1721-1730. https://doi.org/10.1145/2783258.2788613
Chen, I. Y., Johansson, F. D., & Sontag, D. (2018). Why is my classifier discriminatory? Advances in Neural Information Processing Systems, 31, 3539-3550. https://proceedings.neurips.cc/paper/2018/file/1f1baa5b8edac74eb4eaa329f14a0361-Paper.pdf
Chen, I. Y., Pierson, E., Rose, S., Joshi, S., Ferryman, K., & Ghassemi, M. (2021). Ethical machine learning in healthcare. Annual Review of Biomedical Data Science, 4, 123-144. https://doi.org/10.1146/annurev-biodatasci-092820-114757
Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153-163. https://doi.org/10.1089/big.2016.0047
Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., & Huq, A. (2017). Algorithmic decision making and the cost of fairness. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 797-806. https://doi.org/10.1145/3097983.3098095
D’Amour, A., Srinivasan, H., Atwood, J., Baljekar, P., Sculley, D., & Halpern, Y. (2020). Fairness is not static: deeper understanding of long term fairness via simulation studies. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 525-534. https://doi.org/10.1145/3351095.3372878
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, 214-226. https://doi.org/10.1145/2090236.2090255
Efron, B., & Tibshirani, R. J. (1994). An Introduction to the Bootstrap. CRC Press.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning. Springer.
Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. Proceedings of the 33rd International Conference on Machine Learning, 48, 1050-1059. http://proceedings.mlr.press/v48/gal16.html
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.
Ghassemi, M., Naumann, T., Schulam, P., Beam, A. L., Chen, I. Y., & Ranganath, R. (2020). A review of challenges and opportunities in machine learning for health. AMIA Summits on Translational Science Proceedings, 2020, 191-200. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7233077/
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Green, B., & Chen, Y. (2019). Disparate interactions: An algorithm-in-the-loop analysis of fairness in risk assessments. Proceedings of the Conference on Fairness, Accountability, and Transparency, 90-99. https://doi.org/10.1145/3287560.3287563
Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems, 29, 3315-3323. https://proceedings.neurips.cc/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d-Paper.pdf
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.
Holstein, K., Wortman Vaughan, J., Daumé III, H., Dudík, M., & Wallach, H. (2019). Improving fairness in machine learning systems: What do industry practitioners need? Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1-16. https://doi.org/10.1145/3290605.3300830
Huang, K., Altosaar, J., & Ranganath, R. (2020). ClinicalBERT: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342. https://arxiv.org/abs/1904.05342
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
Joachims, T., Swaminathan, A., & Schnabel, T. (2017). Unbiased learning-to-rank with biased feedback. Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 781-789. https://doi.org/10.1145/3018661.3018699
Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer.
Kamiran, F., & Calders, T. (2012). Data preprocessing techniques for discrimination prevention. Knowledge and Information Systems, 33(1), 1-33. https://doi.org/10.1007/s10115-011-0463-8
Kleinberg, J., Ludwig, J., Mullainathan, S., & Rambachan, A. (2018). Algorithmic fairness. AEA Papers and Proceedings, 108, 22-27. https://doi.org/10.1257/pandp.20181018
Kohavi, R., & Provost, F. (1998). Glossary of terms. Machine Learning, 30(2-3), 271-274.
Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C., & Faisal, A. A. (2018). The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11), 1716-1720. https://doi.org/10.1038/s41591-018-0213-5
Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). Counterfactual fairness. Advances in Neural Information Processing Systems, 30, 4066-4076. https://proceedings.neurips.cc/paper/2017/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf
Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788-791. https://doi.org/10.1038/44565
Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., & Hardt, M. (2018). Delayed impact of fair machine learning. Proceedings of the 35th International Conference on Machine Learning, 80, 3150-3158. http://proceedings.mlr.press/v80/liu18c.html
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4765-4774. https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 752(1), 41-48.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6), 1-35. https://doi.org/10.1145/3457607
Mitchell, S., Potash, E., Barocas, S., D’Amour, A., & Lum, K. (2021). Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application, 8, 141-163. https://doi.org/10.1146/annurev-statistics-042720-125902
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
Narayanan, A. (2018). Translation tutorial: 21 fairness definitions and their politics. Proceedings of the 2018 Conference on Fairness, Accountability, and Transparency, 1-1. https://www.youtube.com/watch?v=jIXIuYdnyyk
Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447-453. https://doi.org/10.1126/science.aax2342
Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
Pearl, J., & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books.
Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., & Weinberger, K. Q. (2017). On fairness and calibration. Advances in Neural Information Processing Systems, 30, 5680-5689. https://proceedings.neurips.cc/paper/2017/file/b8b9c74ac526fffbeb2d39ab038d1cd7-Paper.pdf
Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine learning in medicine. New England Journal of Medicine, 380(14), 1347-1358. https://doi.org/10.1056/NEJMra1814259
Rajkomar, A., Hardt, M., Howell, M. D., Corrado, G., & Chin, M. H. (2018). Ensuring fairness in machine learning to advance health equity. Annals of Internal Medicine, 169(12), 866-872. https://doi.org/10.7326/M18-1990
Raschka, S. (2018). Model evaluation, model selection, and algorithm selection in machine learning. arXiv preprint arXiv:1811.12808. https://arxiv.org/abs/1811.12808
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135-1144. https://doi.org/10.1145/2939672.2939778
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592. https://doi.org/10.1093/biomet/63.3.581
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206-215. https://doi.org/10.1038/s42256-019-0048-x
Saxena, N. A., Huang, K., DeFilippis, E., Radanovic, G., Parkes, D. C., & Liu, Y. (2019). How do fairness definitions fare? Examining public attitudes towards algorithmic definitions of fairness. Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 99-106. https://doi.org/10.1145/3306618.3314248
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press.
Trefethen, L. N., & Bau III, D. (1997). Numerical Linear Algebra. SIAM.
Ustun, B., & Rudin, C. (2016). Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102(3), 349-391. https://doi.org/10.1007/s10994-015-5528-6
Verma, S., & Rubin, J. (2018). Fairness definitions explained. Proceedings of the International Workshop on Software Fairness, 1-7. https://doi.org/10.1145/3194770.3194776
Wachter, S., Mittelstadt, B., & Russell, C. (2021). Why fairness cannot be automated: Bridging the gap between EU non-discrimination law and AI. Computer Law & Security Review, 41, 105567. https://doi.org/10.1016/j.clsr.2021.105567
Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press.
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer.
Zafar, M. B., Valera, I., Gomez Rodriguez, M., & Gummadi, K. P. (2017). Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. Proceedings of the 26th International Conference on World Wide Web, 1171-1180. https://doi.org/10.1145/3038912.3052660
Zhang, B. H., Lemoine, B., & Mitchell, M. (2018). Mitigating unwanted biases with adversarial learning. Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, 335-340. https://doi.org/10.1145/3278721.3278779
Zink, A., & Rose, S. (2020). Fair regression for health care spending. Biometrics, 76(3), 973-982. https://doi.org/10.1111/biom.13206