Introduction#

  • Data Wrangler
  • Basic Bias
  • SageMaker Clarify

Wrangler Flow#

sagemaker data wrangler

Let create a simple data flow with wrangler to

  • check data quality
  • drop some columns
  • fill missing values
  • add processing job

Upload the raw ecg data to s3

aws s3 cp 171A_raw.csv s3://bucket-name/ecg/

Import data to wrangler, and add analysis steps. Below is the entire flow

Data Bias Detection#

Some questions about Bias

  • Does the group representation in the training data reflect the real world?
  • Does the model have different accuracy for different groups?

Quoted what is bias mean: An imbalance in the training data or the prediction behavior of the model across different groups, such as age or income bracket. Biases can result from the data or algorithm used to train your model. For instance, if an ML model is trained primarily on data from middle-aged individuals, it may be less accurate when making predictions involving younger and older people. Sample Notebook and Blog

How to detecte bias

  • Wrangler bias report
  • Clarify for pre-training and post-training bias analysis

Let use SageMaker Clarify to detect bias in data. Download this bank additional dataset

wget https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip

Then unzip and upload to S3

aws s3 cp . s3://bucket-name/bank-additional/ --recursive

Create a wrangler flow, import data, and add an analysis to see the Bias report.

  • Predicted Y: term deposit
  • Bias analysis on column: martial

The Bias report show some basic metrics such as Class Imbalance.

CI = (na - nd)(na + nd)

From the results, we observe a class imbalance where the married class is 21% more represented than other classes. We also observe that the married class is 2.8% less likely to subscribe to a bank term deposit.

SageMaker Clarify#

Follow this notebook to understand how to use SageMaker Clarify to detect bias pre-training and post-training.

Inspect male versus female representation in label

training_data["Sex"].value_counts().sort_values().plot(kind="bar", title="Counts of Sex", rot=0)

Inspect male versus female representation in possible label

training_data["Sex"].where(training_data["Target"] == ">50K").value_counts().sort_values().plot(
kind="bar", title="Counts of Sex earning >$50K", rot=0
)

Create SageMaker Clarify Processor

clarify_processor = clarify.SageMakerClarifyProcessor(
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
sagemaker_session=sagemaker_session
)

Create Writing DataConfig which communicates some basic information about data I/O to SageMaker Clarify.

bias_report_output_path = "s3://{}/{}/clarify-bias".format(bucket, prefix)
bias_data_config = clarify.DataConfig(
s3_data_input_path=train_uri,
s3_output_path=bias_report_output_path,
label="Target",
headers=training_data.columns.to_list(),
dataset_type="text/csv",
)

Create Writing ModelConfig which is an object communicates information about your trained model.

model_config = clarify.ModelConfig(
model_name=model_name,
instance_type="ml.m5.xlarge",
instance_count=1,
accept_type="text/csv",
content_type="text/csv",
)

Create Writing ModelPredictedLabelConfig provides information on the format of your predictions

predictions_config = clarify.ModelPredictedLabelConfig(probability_threshold=0.8)

Create Writing BiasConfig contains configuration values for detecting bias using a Clarify container.

bias_config = clarify.BiasConfig(
label_values_or_threshold=[1],
facet_name="Sex",
facet_values_or_threshold=[0],
group_name="Age"
)

Run the Clarify Processor to get bias information of post-training and pre-training

clarify_processor.run_bias(
data_config=bias_data_config,
bias_config=bias_config,
model_config=model_config,
model_predicted_label_config=predictions_config,
pre_training_methods="all",
post_training_methods="all",
)

Finally, you can find the bias report in SageMaker Studio Experiments.

Reference#