Skip to main content

Managed ETL using AWS Glue and Spark

Managed ETL using AWS Glue and Spark

Managed ETL using AWS Glue and Spark

ETL, Extract, Transform and Load workloads are becoming popular lately. Increasing number of companies are looking for solutions to get…

Managed ETL using AWS Glue and Spark

ETL, Extract, Transform and Load workloads are becoming popular lately. An increasing number of companies are looking for solutions to solve their ETL problems. Moving data from one datastore to another can become a really expensive solution if the right tools are not chosen. AWS Glue provides easy to use tools for getting ETL workloads done. AWS Glue runs your ETL jobs in an Apache Spark Serverless environment, so you are not managing any Spark clusters by yourself.

In order to experience the basic functionality of Glue, we will showcase how to use Glue with MongoDB as a data source. We will be moving data from MongoDB collections to S3 for analytic purposes. Later on, we can query the data in S3 using Athena, interactive query service which allows us to execute SQL queries against the data in S3 buckets. We will use AWS CDK to deploy and provision necessary resources and scripts.

For this tutorial, we will be using CData JDCB driver. If you want to follow the same process, you need to contact CData customer support to get your authentication key for a trial.

After the imports and environment variables, we need to establish a connection to the Spark cluster, which AWS is running for us in the background. After that, we will be looping over provided MongoDB collections to read collection into data frames. At last, we will loop over these data frames and write them in S3 bucket as a final location in CSV format.

We can now continue with the provisioning necessary AWS resources. CDK currently supports Cloudformation version of AWS Glue package. We will use those to provision Glue Job and a scheduler.

At first, we are creating an S3 bucket for storing our JDBC driver, Spark script and the main destination for MongoDB collections. We add our dependencies (JDBC driver and Spark scripts) via s3Deployment package inside CDK. This will upload all necessary files during the bootstrapping period. Next, we are creating an IAM role for the Glue job, so it can interact with files in the S3 bucket. After that, we will be creating the main Glue Job, which its job to run the script using our dependencies. And for the last step, we add a scheduler to invoke Glue Job every 60 minutes.

For deploying:

# Install dependencies
npm install
# Create .env file
AWS_REGION="us-east-1"
AWS_ACCOUNT_ID=""
RTK=""
MONGO_SERVER=""
MONGO_USER=""
MONGO_PASSWORD=""
MONGO_PORT=""
MONGO_SSL=true
MONGO_DATABASE=staging
COLLECTIONS="users,readers,admins"
BUCKET_NAME="mongo-glue-etl"
# Bootstrap resources
cdk bootstrap
# Deploy using CDK CLI
cdk deploy

After the deployment is finished, head over to the Glue console. You should be seeing your job run by a scheduler every hour. Select the Job on Glue console and run it manually for now. After some time, your job should be finished and you should see the status Succeeded for the run. You can also check the logs in Cloudwatch to see what AWS Glue does that in background.

Head over to the S3 console to see transformed data.

To remove this stack completely, you need to manually delete S3 bucket and run CDK command to delete the deployed Stack.

cdk destroy

And that’s it, that was easy to setup. By using managed services your company can spend more time on product features, instead of managing the underlying infrastructure or software.


Popular posts from this blog

Concurrency With Boto3

Concurrency with Boto3 Concurrency with Boto3 Asyncio provides set of tools for concurrent programming in Python. In a very simple sense it does this by having an event loop execute a… Concurrency in Boto3 Asyncio provides a set of tools for concurrent programming in Python . In a very simple sense, it does this by having an event loop execute a collection of tasks, with a key difference being that each task chooses when to yield control back to the event loop. Asyncio is a good fit for IO-bound and high-level structured network code. Boto3 (AWS Python SDK) falls into this category. A lot of existing libraries are not ready to be used with asyncio out of the box. They may block, or depend on concurrency features not available through the module. It’s still possible to use those libraries in an application based on asyncio by using an executor from concurrent.futures to run the code either in a separate thread or a separate process. The run_in_executor() method of the event...

AWS Lambda Function URLs

AWS Lambda Function URLs AWS Lambda Function URLs AWS Lambda is a Serverless computing service offered by Amazon Web Services (AWS) that allows developers to run code without provisioning… AWS Lambda Function URLs AWS Lambda AWS Lambda is a Serverless computing service offered by Amazon Web Services ( AWS ) that allows developers to run code without provisioning or managing servers. In this tutorial, we will explore AWS Lambda Function URLs , which are the endpoints that allow you to invoke your Lambda functions. AWS Lambda Function URLs are unique HTTP endpoints that you can create using AWS Console, SDK or any other IaC tool. These URLs are used to trigger your Lambda function, and they can be integrated with a variety of workloads. Function URLs are dual stack-enabled, supporting IPv4 and IPv6. After you configure a function URL for your function, you can invoke your function through its HTTP(S) endpoint via a web browser, curl, Postman, or any HTTP client. Once you create ...

DNS Failover with Route53

DNS Failover with Route53 DNS Failover with Route53 Route 53‘s DNS Failover feature gives you the power to monitor your website and automatically route your visitors to a backup site if it… DNS Failover with Route53 Route 53 ‘s DNS Failover feature gives you the power to monitor your website and automatically route your visitors to a backup site if the main target is not healthy. To showcase this feature, we are going to deploy an application, which we built in this blog post , to two different AWS regions. We are also going to set active-passive failover in Route53, then we will remove the application from one region and we’ll observe how DNS queries will react to the changes. AWS describes the failover scenarios in 3 different categories Active-passive : Route 53 actively returns a primary resource. In case of failure, Route 53 returns the backup resource. Configured using a failover policy. Active-active : Route 53 actively returns more than one resource. In case of failure...