CDK Entest

Introduction#

Data Wrangler
Basic Bias
SageMaker Clarify

Wrangler Flow#

Let create a simple data flow with wrangler to

check data quality
drop some columns
fill missing values
add processing job

Upload the raw ecg data to s3

aws s3 cp 171A_raw.csv s3://bucket-name/ecg/

Import data to wrangler, and add analysis steps. Below is the entire flow

Data Bias Detection#

Some questions about Bias

Does the group representation in the training data reflect the real world?
Does the model have different accuracy for different groups?

Quoted what is bias mean: An imbalance in the training data or the prediction behavior of the model across different groups, such as age or income bracket. Biases can result from the data or algorithm used to train your model. For instance, if an ML model is trained primarily on data from middle-aged individuals, it may be less accurate when making predictions involving younger and older people. Sample Notebook and Blog

How to detecte bias

Wrangler bias report
Clarify for pre-training and post-training bias analysis

Let use SageMaker Clarify to detect bias in data. Download this bank additional dataset

wget https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip

Then unzip and upload to S3

aws s3 cp . s3://bucket-name/bank-additional/ --recursive

Create a wrangler flow, import data, and add an analysis to see the Bias report.

Predicted Y: term deposit
Bias analysis on column: martial

The Bias report show some basic metrics such as Class Imbalance.

CI = (na - nd)(na + nd)

From the results, we observe a class imbalance where the married class is 21% more represented than other classes. We also observe that the married class is 2.8% less likely to subscribe to a bank term deposit.

SageMaker Clarify#

Follow this notebook to understand how to use SageMaker Clarify to detect bias pre-training and post-training.

Inspect male versus female representation in label

training_data["Sex"].value_counts().sort_values().plot(kind="bar", title="Counts of Sex", rot=0)

Inspect male versus female representation in possible label

training_data["Sex"].where(training_data["Target"] == ">50K").value_counts().sort_values().plot(
    kind="bar", title="Counts of Sex earning >$50K", rot=0
)

Create SageMaker Clarify Processor

clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    sagemaker_session=sagemaker_session
)

Create Writing DataConfig which communicates some basic information about data I/O to SageMaker Clarify.

bias_report_output_path = "s3://{}/{}/clarify-bias".format(bucket, prefix)
bias_data_config = clarify.DataConfig(
    s3_data_input_path=train_uri,
    s3_output_path=bias_report_output_path,
    label="Target",
    headers=training_data.columns.to_list(),
    dataset_type="text/csv",
)

Create Writing ModelConfig which is an object communicates information about your trained model.

model_config = clarify.ModelConfig(
    model_name=model_name,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    accept_type="text/csv",
    content_type="text/csv",
)

Create Writing ModelPredictedLabelConfig provides information on the format of your predictions

predictions_config = clarify.ModelPredictedLabelConfig(probability_threshold=0.8)

Create Writing BiasConfig contains configuration values for detecting bias using a Clarify container.

bias_config = clarify.BiasConfig(
    label_values_or_threshold=[1],
    facet_name="Sex",
    facet_values_or_threshold=[0],
    group_name="Age"
)

Run the Clarify Processor to get bias information of post-training and pre-training

clarify_processor.run_bias(
    data_config=bias_data_config,
    bias_config=bias_config,
    model_config=model_config,
    model_predicted_label_config=predictions_config,
    pre_training_methods="all",
    post_training_methods="all",
)

Finally, you can find the bias report in SageMaker Studio Experiments.

Wrangler Flow#

Data Bias Detection#

SageMaker Clarify#

Reference#