Introduction#
- Basic Bias: 21 metrics
- SageMaker Clarify notebook
- I calcualte bias notebook
Data Bias Detection#
Some questions about Bias
- Does the group representation in the training data reflect the real world?
- Does the model have different accuracy for different groups?
- Is the model performance equal for different gorups or facets?
Quoted what is bias mean: An imbalance in the training data or the prediction behavior of the model across different groups, such as age or income bracket. Biases can result from the data or algorithm used to train your model. For instance, if an ML model is trained primarily on data from middle-aged individuals, it may be less accurate when making predictions involving younger and older people. Sample Notebook and Blog
How to detecte bias
- Wrangler bias report
- Clarify for pre-training and post-training bias analysis
Let use SageMaker Clarify to detect bias in data. Download this bank additional dataset
wget https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip
Then unzip and upload to S3
aws s3 cp . s3://bucket-name/bank-additional/ --recursive
Create a wrangler flow, import data, and add an analysis to see the Bias report.
- Predicted Y: term deposit
- Bias analysis on column: martial
The Bias report show some basic metrics such as Class Imbalance.
CI = (na - nd)(na + nd)
From the results, we observe a class imbalance where the married class is 21% more represented than other classes. We also observe that the married class is 2.8% less likely to subscribe to a bank term deposit.
SageMaker Clarify - UCI DataSet#
Follow this notebook to understand how to use SageMaker Clarify to detect bias pre-training and post-training.
Let download another dataset which is the UCI dataset. This predict income greater or less than 50K.
from sagemaker.s3 import S3Downloaderadult_columns = ["Age","Workclass","fnlwgt","Education","Education-Num","Marital Status","Occupation","Relationship","Ethnic group","Sex","Capital Gain","Capital Loss","Hours per week","Country","Target",]S3Downloader.download(s3_uri=f"s3://sagemaker-example-files-prod-{region}/datasets/tabular/uci_adult/adult.data",local_path="./",sagemaker_session=session,)
Let analyse and visualize it sex column, how male versus female represents in data and in label.
training_data = pd.read_csv("adult.data", names=adult_columns, sep=r"\s*,\s*", engine="python", na_values="?").dropna()training_data["Sex"].value_counts().plot.bar()plt.title("Count of Sex")plt.show()training_data["Sex"].where(training_data["Target"] == ">50K").value_counts().plot.bar()plt.title("Counts of Sex earning >$50K")plt.show()
SageMaker Clarify - Prepare Data#
Now let prepare data for training a XGBoost algorithm. We need to convert catalogrical into numerical using a number_encode_features function as below.
from sklearn import preprocessingdef number_encode_features(df: pd.DataFrame):"""convert categorical to numerical"""result = df.copy()encoders = {}for column in result.columns:if result.dtypes[column] == object:encoders[column] = preprocessing.LabelEncoder()result[column] = encoders[column].fit_transform(result[column].fillna("None"))return result, encoders
Create train and test data
# swap the target as first columntraining_data = pd.concat([training_data["Target"], training_data.drop(["Target"], axis=1)], axis=1)# swap the target as first columntesting_data = pd.concat([testing_data["Target"], testing_data.drop(["Target"], axis=1)], axis=1)# save train data to csvtraining_data, _ = number_encode_features(training_data)training_data.to_csv("train_data.csv", index=False, header=False)# save test data to csvtesting_data, _ = number_encode_features(testing_data)testing_data.to_csv("test_features.csv", index=False, header=False)
SageMaker Clarify - Train Model#
Let create TrainingInput object
from sagemaker.inputs import TrainingInputfrom sagemaker.s3 import S3Uploaderprefix = "sagemaker-clarify"# upload train data to s3train_uri = S3Uploader.upload(local_path="./train_data.csv", desired_s3_uri=f"s3://{bucket}/{prefix}")# upload test data to s3test_uri = S3Uploader.upload(local_path="./test_features.csv", desired_s3_uri=f"s3://{bucket}/{prefix}")
Let create a XGBoost model
from sagemaker.inputs import TrainingInputfrom sagemaker.s3 import S3Uploaderxgboost_image_uri = retrieve("xgboost", region, version="1.5-1")xgb = Estimator(xgboost_image_uri,role,instance_count=1,instance_type="ml.m5.xlarge",disable_profiler=True,sagemaker_session=session,)xgb.set_hyperparameters(max_depth=5,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,objective="binary:logistic",num_round=800,)xgb.fit({"train": train_input}, logs=False)
After training, we have to create a Model object
from datetime import datetimemodel_name = f"demo-clarify-bias-{datetime.now().strftime('%d-%m-%Y-%H-%M-%S')}"model = xgb.create_model(name=model_name)model.create()
SageMaker Clarify - Processor#
To get the bias report pre and post training, we need to run a SageMaker Clarify processing Create SageMaker Clarify Processor
clarify_processor = clarify.SageMakerClarifyProcessor(role=role,instance_count=1,instance_type="ml.m5.xlarge",sagemaker_session=session)
Create Writing DataConfig which communicates some basic information about data I/O to SageMaker Clarify.
data_config = clarify.DataConfig(s3_data_input_path=train_uri,s3_output_path=bias_report_output_path,label="Target",headers=training_data.columns.to_list(),dataset_type="text/csv",)
Create Writing ModelConfig which is an object communicates information about your trained model.
model_config = clarify.ModelConfig(model_name=model_name,instance_type="ml.m5.xlarge",instance_count=1,accept_type="text/csv",content_type="text/csv",)
Create Writing ModelPredictedLabelConfig provides information on the format of your predictions
prediction_config = clarify.ModelPredictedLabelConfig(probability_threshold=0.8)
Create Writing BiasConfig contains configuration values for detecting bias using a Clarify container.
bias_config = clarify.BiasConfig(label_values_or_threshold=[1],facet_name="Sex",facet_values_or_threshold=[0],group_name="Age",)
Run the Clarify Processor to get bias information of post-training and pre-training
clarify_processor.run_bias(data_config=data_config,bias_config=bias_config,model_config=model_config,model_predicted_label_config=prediction_config,pre_training_methods="all",post_training_methods="all")
Finally, you can find the bias report in SageMaker Studio Experiments, or read the report in json format as below
{"name": "CI","description": "Class Imbalance (CI)","value": 0.3513692725946555},{"name": "DPL","description": "Difference in Positive Proportions in Labels (DPL)","value": 0.20015891077100018},{"name": "JS","description": "Jensen-Shannon Divergence (JS)","value": 0.03075614465977302},{"name": "KL","description": "Kullback-Liebler Divergence (KL)","value": 0.14306865156306428},