Skip to main content

Managed ETL using AWS Glue and Spark

Managed ETL using AWS Glue and Spark

Managed ETL using AWS Glue and Spark

ETL, Extract, Transform and Load workloads are becoming popular lately. Increasing number of companies are looking for solutions to get…

Managed ETL using AWS Glue and Spark

ETL, Extract, Transform and Load workloads are becoming popular lately. An increasing number of companies are looking for solutions to solve their ETL problems. Moving data from one datastore to another can become a really expensive solution if the right tools are not chosen. AWS Glue provides easy to use tools for getting ETL workloads done. AWS Glue runs your ETL jobs in an Apache Spark Serverless environment, so you are not managing any Spark clusters by yourself.

In order to experience the basic functionality of Glue, we will showcase how to use Glue with MongoDB as a data source. We will be moving data from MongoDB collections to S3 for analytic purposes. Later on, we can query the data in S3 using Athena, interactive query service which allows us to execute SQL queries against the data in S3 buckets. We will use AWS CDK to deploy and provision necessary resources and scripts.

For this tutorial, we will be using CData JDCB driver. If you want to follow the same process, you need to contact CData customer support to get your authentication key for a trial.

After the imports and environment variables, we need to establish a connection to the Spark cluster, which AWS is running for us in the background. After that, we will be looping over provided MongoDB collections to read collection into data frames. At last, we will loop over these data frames and write them in S3 bucket as a final location in CSV format.

We can now continue with the provisioning necessary AWS resources. CDK currently supports Cloudformation version of AWS Glue package. We will use those to provision Glue Job and a scheduler.

At first, we are creating an S3 bucket for storing our JDBC driver, Spark script and the main destination for MongoDB collections. We add our dependencies (JDBC driver and Spark scripts) via s3Deployment package inside CDK. This will upload all necessary files during the bootstrapping period. Next, we are creating an IAM role for the Glue job, so it can interact with files in the S3 bucket. After that, we will be creating the main Glue Job, which its job to run the script using our dependencies. And for the last step, we add a scheduler to invoke Glue Job every 60 minutes.

For deploying:

# Install dependencies
npm install
# Create .env file
AWS_REGION="us-east-1"
AWS_ACCOUNT_ID=""
RTK=""
MONGO_SERVER=""
MONGO_USER=""
MONGO_PASSWORD=""
MONGO_PORT=""
MONGO_SSL=true
MONGO_DATABASE=staging
COLLECTIONS="users,readers,admins"
BUCKET_NAME="mongo-glue-etl"
# Bootstrap resources
cdk bootstrap
# Deploy using CDK CLI
cdk deploy

After the deployment is finished, head over to the Glue console. You should be seeing your job run by a scheduler every hour. Select the Job on Glue console and run it manually for now. After some time, your job should be finished and you should see the status Succeeded for the run. You can also check the logs in Cloudwatch to see what AWS Glue does that in background.

Head over to the S3 console to see transformed data.

To remove this stack completely, you need to manually delete S3 bucket and run CDK command to delete the deployed Stack.

cdk destroy

And that’s it, that was easy to setup. By using managed services your company can spend more time on product features, instead of managing the underlying infrastructure or software.


Popular posts from this blog

Concurrency With Boto3

Concurrency with Boto3 Concurrency with Boto3 Asyncio provides set of tools for concurrent programming in Python. In a very simple sense it does this by having an event loop execute a… Concurrency in Boto3 Asyncio provides a set of tools for concurrent programming in Python . In a very simple sense, it does this by having an event loop execute a collection of tasks, with a key difference being that each task chooses when to yield control back to the event loop. Asyncio is a good fit for IO-bound and high-level structured network code. Boto3 (AWS Python SDK) falls into this category. A lot of existing libraries are not ready to be used with asyncio out of the box. They may block, or depend on concurrency features not available through the module. It’s still possible to use those libraries in an application based on asyncio by using an executor from concurrent.futures to run the code either in a separate thread or a separate process. The run_in_executor() method of the event...

Manage MongoDB Atlas Deployments with AWS CDK

Manage MongoDB Atlas Deployments with AWS CDK Manage MongoDB Atlas Deployments with AWS CDK MongoDB Atlas is a fully-managed cloud-based database service offered by MongoDB. It offers a variety of features such as automatic… Manage MongoDB Atlas Deployments with AWS CDK MongoDB Atlas is a fully-managed cloud-based database service offered by MongoDB. It offers a variety of features such as automatic backups, automatic scaling, and easy integration with other cloud services. AWS Cloud Development Kit(CDK) is a tool provided by Amazon Web Services (AWS) that allows you to define infrastructure as code using familiar programming languages such as TypeScript, JavaScript, Python, and others. MongoDB recently announced general availability for Atlas Integrations for AWS CloudFormation and CDK. In this article, we will go through the process of deploying MongoDB Atlas with AWS CDK. Prerequisites Before we start, you will need the following: An AWS account AWS CDK installed on your lo...

AWS Lambda Function URLs

AWS Lambda Function URLs AWS Lambda Function URLs AWS Lambda is a Serverless computing service offered by Amazon Web Services (AWS) that allows developers to run code without provisioning… AWS Lambda Function URLs AWS Lambda AWS Lambda is a Serverless computing service offered by Amazon Web Services ( AWS ) that allows developers to run code without provisioning or managing servers. In this tutorial, we will explore AWS Lambda Function URLs , which are the endpoints that allow you to invoke your Lambda functions. AWS Lambda Function URLs are unique HTTP endpoints that you can create using AWS Console, SDK or any other IaC tool. These URLs are used to trigger your Lambda function, and they can be integrated with a variety of workloads. Function URLs are dual stack-enabled, supporting IPv4 and IPv6. After you configure a function URL for your function, you can invoke your function through its HTTP(S) endpoint via a web browser, curl, Postman, or any HTTP client. Once you create ...