AWS GLUE — Introduction

Shikhar Sundriyal
3 min readOct 26, 2020

AWS GLUE is a fully managed and serverless ETL service provided by AWS to process your ETL workloads using spark .

Some of important features :
1.) Serverless and Fully Managed
2.) Job Bookmarking
3.) Easy to scale the glue Job both horizontally and vertically

What exactly does Serverless mean ? Is it really serverless ?

To be honest NO the reason is simple whenever you submit a job it requires a server/container on which it will be executed without a server where exactly will the program execute ?

Let me explain why they mention it as a Serverless service . As a customer you just need to write your script and run it . That’s all.

You need not worry about the underlying infrastructure , how the containers get provisioned , the resource availability , latest patches on the servers etc . As a customer you wont be having the access to the underlying server’s on which your program will run and also the infrastructure/containers which are required will be spinned up on the fly i.e. the containers are allocated only once you trigger your GLUE job. As you wont have access to underlying servers on which your jobs/program will run hence the servers are maintained by AWS hence the term Fully managed .

Lets get Started with the basics of AWS GLUE. (a basic pipeline )

Different Components and how they fit in:
1.) Source : This is where the raw or unprocessed data is present.
2.) Crawler : The glue crawler will sample the data at s3 location approximately first 15MB and identify the number of columns along with its datatype.
3.) GLUE Catalog : The glue crawler creates a metadata repository in glue catalog which has a glue database then inside that it will have tables .
4.) GLUE ETL Jobs : Once the glue catalog is created we can run the GLUE ETL jobs to consume the data using dynamic frames , perform the business transformations as required and write the processed data to Sink which can be S3, DynamoDB , Redshift etc.

For BIG Data workloads the most important thing is how we can scale the cluster .

In Glue we can scale the spark cluster both horizontally and vertically :

Scaling the cluster Horizontally : This means adding more number of DPU’s/worker nodes to the cluster which will indeed increase the number of executors .

Scaling the cluster vertically : This means increasing the size of the executors . For vertical scaling we have 3 executor types Standard , G1X and G2X.( configs mentioned below)

We will discuss about some other interesting features like Job bookmarking , different scenarios of how glue crawler behaves , when to scale horizontally and when to scale vertically in future blogs.

--

--

Shikhar Sundriyal

A Senior Data Engineer who loves to explore latest technologies and help others.