By UATeam in AWS — Nov 15, 2024

AWS Glue ETL Job Example: A Step-by-Step Guide

AWS Glue is a fully managed ETL (Extract, Transform, Load) service that simplifies data integration and preparation for analytics and machine learning. With AWS Glue, you can clean, enrich, and transform data from various sources to make it analytics-ready. In this article, we’ll walk through an example of setting up and running an AWS Glue ETL job to process data from an S3 bucket and load the results into another S3 location.

What Is AWS Glue ETL?

AWS Glue ETL allows you to:

Extract data from various sources (e.g., S3, databases).
Transform data using Apache Spark-based jobs.
Load data into your destination, such as S3, Amazon Redshift, or other databases.

The Glue service provides a serverless environment, eliminating the need to manage infrastructure.

AWS Glue ETL Job Example: Step-by-Step

Objective

We’ll build an ETL job to:

Extract CSV data from an S3 bucket.
Transform the data by cleaning and filtering it.
Load the transformed data into another S3 bucket in Parquet format.

Prerequisites

An AWS account.
IAM roles with the necessary permissions for AWS Glue, S3, and CloudWatch.
Data stored in an S3 bucket (e.g., s3://source-bucket/input-data.csv).
Destination S3 bucket for transformed data (e.g., s3://destination-bucket/output-data/).

Step 1: Create an AWS Glue Database

Open the AWS Glue Console.
Navigate to Databases → Add Database.
Enter a database name (e.g., etl_demo_db) and save.

Step 2: Create an AWS Glue Crawler

Go to Crawlers in the Glue Console and select Add Crawler.
Configure the crawler:
- Data Store: Select S3 and provide the path to your source bucket (e.g., s3://source-bucket/).
- IAM Role: Assign an IAM role with access to the S3 bucket.
Run the crawler to populate the Glue Data Catalog with the table schema.

Step 3: Create an AWS Glue ETL Job

Navigate to Jobs in the Glue Console and select Add Job.
Configure the job:
- Name: Enter a name for the job (e.g., etl_demo_job).
- IAM Role: Assign an IAM role with necessary permissions.
- Type: Choose Spark (default).
- Script Path: You can use the Glue script editor or upload a custom script.

Step 4: ETL Script Example

Here’s a Python-based AWS Glue ETL script that reads data from an S3 bucket, filters it, and writes the transformed data to another bucket in Parquet format:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Extract: Load data from source S3 bucket
source_data = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://source-bucket/input-data.csv"]},
    format="csv",
    format_options={"withHeader": True}
)

# Transform: Filter rows where a specific column meets a condition
transformed_data = Filter.apply(
    frame=source_data,
    f=lambda row: int(row["age"]) > 25  # Example: Filter rows where age > 25
)

# Load: Write the transformed data to destination S3 bucket in Parquet format
glueContext.write_dynamic_frame.from_options(
    frame=transformed_data,
    connection_type="s3",
    connection_options={"path": "s3://destination-bucket/output-data/"},
    format="parquet"
)

job.commit()

Step 5: Run the Job

Save and run the ETL job from the AWS Glue Console.
Monitor the job’s progress in the Runs tab.
Check the destination S3 bucket for the transformed data.

Key Features of AWS Glue for ETL

Serverless: No infrastructure to manage.
Built-In Transformations: Pre-built transformations and Apache Spark support.
Data Catalog: Automatically discovers and catalogs metadata.
Job Monitoring: Logs and metrics available in CloudWatch.

Tips for Optimizing AWS Glue ETL Jobs

Partition Data: Use partitions in your S3 bucket to improve query performance.
Optimize Costs: Adjust worker type and number to match job complexity.
Test Scripts Locally: Use AWS Glue’s development endpoints for testing.
Monitor and Debug: Use CloudWatch logs for troubleshooting errors.

Conclusion

AWS Glue makes building and running ETL pipelines seamless and scalable. By following this example, you can extract data from S3, transform it using Python scripts, and load it into a destination S3 bucket in the desired format. AWS Glue is an essential tool for modern data integration and analytics workflows, offering flexibility and ease of use for developers and data engineers.