In Amazon SageMaker AI, identifying bias in machine learning datasets before model training is a critical step to ensure fairness and reliability of predictions. This process is referred to as pre-training bias analysis, and it focuses on understanding whether the training data itself introduces bias—particularly through imbalanced class labels or sensitive attributes.
The Difference in Proportions of Labels (DPL) is a pre-training bias metric specifically designed to measure class imbalance. DPL compares the proportion of a specific label (such as a positive outcome) across different groups or classes within a dataset. If one class or group is overrepresented relative to another, the DPL value will deviate significantly from zero, clearly indicating imbalance. AWS documentation highlights DPL as a key metric used by SageMaker Clarify to detect label imbalance prior to model training.
By contrast, Mean Squared Error (MSE) is a regression evaluation metric used after model training to measure prediction error, not dataset bias. Silhouette score is an unsupervised learning metric used to evaluate clustering quality, making it irrelevant for supervised classification bias detection. Structural Similarity Index Measure (SSIM) is an image-quality metric used in computer vision tasks and has no application in dataset bias analysis.
Using DPL allows ML engineers to proactively detect and address skewed label distributions—such as by re-sampling, re-weighting, or collecting additional data—before training begins. This aligns with AWS best practices for responsible AI and helps reduce the risk of biased predictions that could negatively impact real-world decision-making.
Therefore, Difference in Proportions of Labels (DPL) is the correct and AWS-recommended metric for confirming class imbalance during pre-training bias analysis in Amazon SageMaker AI.
Submit