Introduction#
GitHub this note shows
- Setup Glue kernels for local
- Setup role for Glue notebook
- Read data from S3 to dataframe
- Write data to S3 and Glue catalog
Glue Pyspark Kernel#
- Create an virtual env with pythons
- Install Glue Spark kernel
- Setup the glue_arn_role
It is good to setup an virtual environtment
python3 -m venv env
Then, install the Glue Spark kernel using pip
python -m pip install --user --upgrade jupyter boto3 aws-glue-sessions
Check kernel list
jupyter-kernelspec list
Install Glue PySpark kernel
cd env/lib/python3.8/site-packages/aws_glue_interactive_sessions_kernel
jupyter-kernelspec install aws_glue_pyspark
Install Glue Scala kernel
jupyter-kernelspec install aws_glue
If vscode does not see the Glue PySpark kernel, we need to update the kernel.json
/home/ubuntu/.local/share/jupyter/kernels/glue_pyspark/kernel.json
and update kernel.json with full path to python
{"argv": ["/usr/bin/python3","-m","aws_glue_interactive_sessions_kernel.glue_pyspark.GlueKernel","-f","{connection_file}"],"display_name": "Glue PySpark","language": "python"}
Glue Notebook Role#
Glue ETL Job, notebook usually need to
- Read data from S3
- Do ETL job, write transfomred data to S3
- Create tables in Glue catalog
To enable the local vscode or notebook run the Glue session, we need two roles
- glue_role_arn which the Glue assume when runing ETL job
- IAM role or IAM user for CLI as normal
Then there are two options to grant permissions to the Glue role for notebook
- Use IAM
- Use LakeFormation
First option, setup IAM role for Glue notebook
const role = new aws_iam.Role(this, "RoleForGlueNotebook", {roleName: "RoleForGlueNotebook",assumedBy: new aws_iam.ServicePrincipal("glue.amazonaws.com"),});role.addManagedPolicy(aws_iam.ManagedPolicy.fromAwsManagedPolicyName("service-role/AWSGlueServiceRole"));const policy = new aws_iam.Policy(this,"LeastPriviledgePolicyForGlueNotebookRole",{policyName: "LeastPriviledgePolicyForGlueNotebookRole",statements: [// pass iam rolenew aws_iam.PolicyStatement({actions: ["iam:PassRole", "iam:GetRole"],effect: Effect.ALLOW,resources: ["*"],}),// athenanew aws_iam.PolicyStatement({actions: ["athena:*"],effect: Effect.ALLOW,resources: ["*"],}),// access s3new aws_iam.PolicyStatement({actions: ["s3:*"],effect: Effect.ALLOW,resources: [props.athenaResultBucketArn,`${props.athenaResultBucketArn}/*`,props.sourceBucketArn,`${props.sourceBucketArn}/*`,],}),// access glue catalognew aws_iam.PolicyStatement({actions: ["glue:*"],effect: Effect.ALLOW,resources: [`arn:aws:glue:${this.region}:*:table/${props.databaseName}/*`,`arn:aws:glue:${this.region}:*:database/${props.databaseName}`,`arn:aws:glue:${this.region}:*:catalog`,],}),],});policy.attachToRole(role);
Magic Cells#
We use magic cell to setup idle time, Glue version, number of worker, etc. For example
%additional_python_modules matplotlib, numpy, pandas%idle_timeout 15%glue_version 3.0%number_of_workers 5
Please do not forget to stop the interactive session to save cost, otherwise it will be there for 48 hours
%stop_session
Read Parquet Dataset#
- Import Glue and Spark modules
- Create Glue and Spark context
- Read parquet, tsv from S3 into dataframe
from awsglue.context import GlueContextfrom pyspark.context import SparkContext# create spark and glue contextsparkContext = SparkContext.getOrCreate()glueContext = GlueContext(sparkContext)# create spark sessionspark = glueContext.spark_session# create spark session another wayfrom pyspark.sql import SparkSessionspark2 = SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()# read parquet data from s3df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Apparel/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet")# print DF to checkdf.show(10)
Another way to read data from S3 into a dataframe
df = glueContext.create_dynamic_frame.from_options(format_options={"quoteChar": '"',"withHeader": True,"separator": "\t",},connection_type="s3",format="csv",connection_options={"path": "s3://{0}/amazon-review-tsv-parquet/".format(data_lake_bucket),"recurse": True,},transformation_ctx="S3bucket_node1",)
Similar for parquet data
df = glueContext.write_dynamic_frame.from_options(frame=S3bucket_node1,connection_type="s3",format="glueparquet",connection_options={"path": "s3://{0}/amazon-review-tsv-parquet/".format(data_lake_bucket),"partitionKeys": [],"enableUpdateCatalog": True,"database":"default","table":"amazon_review_tsv_parquet",},format_options={"compression": "uncompressed"},transformation_ctx="S3bucket_node2",)
Write DataFrame#
The below method does two tasks
- Create a new table in Glue catalog
- Write data to S3
- Glue role permissions can be grated via LakeFormation or pure IAM
- It will take about 10 minutes for amazon-reviews-pds/tsv => parquet
sink = glueContext.getSink(path="s3://{0}/amazon-review-tsv-parquet/".format(data_lake_bucket),connection_type="s3",updateBehavior="UPDATE_IN_DATABASE",partitionKeys=[],# compression="snappy",enableUpdateCatalog=True,transformation_ctx="write_sink",)sink.setCatalogInfo(catalogDatabase="default",catalogTableName="amazon_review_tsv_parquet")sink.setFormat("glueparquet")sink.writeFrame(df)
Then, it is easy to read data from a table into a dataframe
- Check Glue role permission to access the Glue Catalog Table
- Check LakeFormation grant permissions to the Glue role
df2 = glueContext.create_dynamic_frame.from_catalog(database="default",table_name="amazon_reviews_parquet")df2.show(10)
check the output
df2.show(10)