Introduction#

GitHub this note shows

  • Setup Glue kernels for local
  • Setup role for Glue notebook
  • Read data from S3 to dataframe
  • Write data to S3 and Glue catalog

Glue Pyspark Kernel#

  • Create an virtual env with pythons
  • Install Glue Spark kernel
  • Setup the glue_arn_role

It is good to setup an virtual environtment

python3 -m venv env

Then, install the Glue Spark kernel using pip

python -m pip install --user --upgrade jupyter boto3 aws-glue-sessions

Check kernel list

jupyter-kernelspec list

Install Glue PySpark kernel

cd env/lib/python3.8/site-packages/aws_glue_interactive_sessions_kernel
jupyter-kernelspec install aws_glue_pyspark

Install Glue Scala kernel

jupyter-kernelspec install aws_glue

If vscode does not see the Glue PySpark kernel, we need to update the kernel.json

/home/ubuntu/.local/share/jupyter/kernels/glue_pyspark/kernel.json

and update kernel.json with full path to python

{
"argv": [
"/usr/bin/python3",
"-m",
"aws_glue_interactive_sessions_kernel.glue_pyspark.GlueKernel",
"-f",
"{connection_file}"
],
"display_name": "Glue PySpark",
"language": "python"
}

Glue Notebook Role#

Glue ETL Job, notebook usually need to

  • Read data from S3
  • Do ETL job, write transfomred data to S3
  • Create tables in Glue catalog

To enable the local vscode or notebook run the Glue session, we need two roles

  • glue_role_arn which the Glue assume when runing ETL job
  • IAM role or IAM user for CLI as normal

Then there are two options to grant permissions to the Glue role for notebook

  • Use IAM
  • Use LakeFormation

First option, setup IAM role for Glue notebook

const role = new aws_iam.Role(this, "RoleForGlueNotebook", {
roleName: "RoleForGlueNotebook",
assumedBy: new aws_iam.ServicePrincipal("glue.amazonaws.com"),
});
role.addManagedPolicy(
aws_iam.ManagedPolicy.fromAwsManagedPolicyName(
"service-role/AWSGlueServiceRole"
)
);
const policy = new aws_iam.Policy(
this,
"LeastPriviledgePolicyForGlueNotebookRole",
{
policyName: "LeastPriviledgePolicyForGlueNotebookRole",
statements: [
// pass iam role
new aws_iam.PolicyStatement({
actions: ["iam:PassRole", "iam:GetRole"],
effect: Effect.ALLOW,
resources: ["*"],
}),
// athena
new aws_iam.PolicyStatement({
actions: ["athena:*"],
effect: Effect.ALLOW,
resources: ["*"],
}),
// access s3
new aws_iam.PolicyStatement({
actions: ["s3:*"],
effect: Effect.ALLOW,
resources: [
props.athenaResultBucketArn,
`${props.athenaResultBucketArn}/*`,
props.sourceBucketArn,
`${props.sourceBucketArn}/*`,
],
}),
// access glue catalog
new aws_iam.PolicyStatement({
actions: ["glue:*"],
effect: Effect.ALLOW,
resources: [
`arn:aws:glue:${this.region}:*:table/${props.databaseName}/*`,
`arn:aws:glue:${this.region}:*:database/${props.databaseName}`,
`arn:aws:glue:${this.region}:*:catalog`,
],
}),
],
}
);
policy.attachToRole(role);

Magic Cells#

We use magic cell to setup idle time, Glue version, number of worker, etc. For example

%additional_python_modules matplotlib, numpy, pandas
%idle_timeout 15
%glue_version 3.0
%number_of_workers 5

Please do not forget to stop the interactive session to save cost, otherwise it will be there for 48 hours

%stop_session

Read Parquet Dataset#

  • Import Glue and Spark modules
  • Create Glue and Spark context
  • Read parquet, tsv from S3 into dataframe
from awsglue.context import GlueContext
from pyspark.context import SparkContext
# create spark and glue context
sparkContext = SparkContext.getOrCreate()
glueContext = GlueContext(sparkContext)
# create spark session
spark = glueContext.spark_session
# create spark session another way
from pyspark.sql import SparkSession
spark2 = SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()
# read parquet data from s3
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Apparel/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet")
# print DF to check
df.show(10)

Another way to read data from S3 into a dataframe

df = glueContext.create_dynamic_frame.from_options(
format_options={
"quoteChar": '"',
"withHeader": True,
"separator": "\t",
},
connection_type="s3",
format="csv",
connection_options={
"path": "s3://{0}/amazon-review-tsv-parquet/".format(data_lake_bucket),
"recurse": True,
},
transformation_ctx="S3bucket_node1",
)

Similar for parquet data

df = glueContext.write_dynamic_frame.from_options(
frame=S3bucket_node1,
connection_type="s3",
format="glueparquet",
connection_options={
"path": "s3://{0}/amazon-review-tsv-parquet/".format(data_lake_bucket),
"partitionKeys": [],
"enableUpdateCatalog": True,
"database":"default",
"table":"amazon_review_tsv_parquet",
},
format_options={"compression": "uncompressed"},
transformation_ctx="S3bucket_node2",
)

Write DataFrame#

The below method does two tasks

  • Create a new table in Glue catalog
  • Write data to S3
  • Glue role permissions can be grated via LakeFormation or pure IAM
  • It will take about 10 minutes for amazon-reviews-pds/tsv => parquet
sink = glueContext.getSink(
path="s3://{0}/amazon-review-tsv-parquet/".format(data_lake_bucket),
connection_type="s3",
updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=[],
# compression="snappy",
enableUpdateCatalog=True,
transformation_ctx="write_sink",
)
sink.setCatalogInfo(
catalogDatabase="default",
catalogTableName="amazon_review_tsv_parquet"
)
sink.setFormat("glueparquet")
sink.writeFrame(df)

Then, it is easy to read data from a table into a dataframe

  • Check Glue role permission to access the Glue Catalog Table
  • Check LakeFormation grant permissions to the Glue role
df2 = glueContext.create_dynamic_frame.from_catalog(
database="default",
table_name="amazon_reviews_parquet"
)
df2.show(10)

check the output

df2.show(10)

Reference#