AWS Data Pipeline
Summary: for creating ETL workflows to automate processing of data at scheduled intervals, then terminate the resources.
- AWS data pipeline can be used to move data from one dynamoDB table in one region to another region
- behind the scenes it launches EMR cluster -> stores in S3 -> starts EMR in the other region, Loads from S3 and sends to dynamoDB.
Pipeline
- composed of:
- Datanodes
- Activities
- Preconditions
- Schedules
- The pipeline is executed in EC2 instances or EMR clusters that are provisioned and terminated automatically.
Running on premise:
- Install a task runner package on your on-prem hosts
- The package continuously polls AWS Data Pipeline for work to perform
- Examples: Running a DB stored procedure or a database dump.
Activity:
- An action that Data Pipeline initiates on your behalf as part of a pipeline
- examples:
- CopyActivity (copies data between instances)
- EmrActivity (spawns emr, ... , terminates)
- ShellCommandActivity (execute a custom shell script)
- ...
- examples:
Precondition
- It's optional. Can be applied to a data source or activity.
- When a data source has a precondition check then that check must succeed before any activity that consumes that data source can be launched.
- examples:
- DynamoDBDataExists
- DynamoDBTableExists
- S3KeyExists
- S3PrefixExists
- ShellCommandPrecondition (for custom preconditions)
Schedule:
- how frequently does the pipeline run.
A lot of data pipeline functionality has been replaced by AWS Lambda.
AWS Data Pipeline
- AWS Data Pipeline is a web service that makes it easy to automate and schedule regular data movement and data processing activities in AWS
- AWS Data Pipeline help define data-driven workflows
- AWS Data Pipeline integrates with on-premises and cloud-based storage systems to allow developers to use their data when they need it, where they want it, and in the required format.
- AWS Data Pipeline allows you to quickly define a pipeline, which defines a dependent chain of data sources, destinations, and predefined or custom data processing activities
- Based on a defined schedule, the pipeline regularly performs processing activities such as distributed data copy, SQL transforms, EMR applications, or custom scripts against destinations such as S3, RDS, or DynamoDB.
- By executing the scheduling, retry, and failure logic for the workflows as a highly scalable and fully managed service, Data Pipeline ensures that the pipelines are robust and highly available.
AWS Data Pipeline features
- Distributed, fault-tolerant and highly available
- Managed workflow orchestration service for data-driven workflows
- Infrastructure management service, will provision and terminate resources as required
- Provides dependency resolution
- Can be scheduled
- Grants control over retries, including frequency and number
- Native integration with S3, DynamoDB, RDS, EMR, EC2 and Redshift
- Support for both AWS based and external on-premise resources
AWS Data Pipeline Concepts
Pipeline Definition
- Pipeline definition helps the business logic to be communicated to the AWS Data Pipeline
- Pipeline definition defines the location of data (Data Nodes), activities to be performed, the schedule, resources to run the activities, per-conditions and actions to be performed
Pipeline Components, Instances, and Attempts
- Pipeline components represent the business logic of the pipeline and are represented by the different sections of a pipeline definition.
- Pipeline components specify the data sources, activities, schedule, and preconditions of the workflow
- When AWS Data Pipeline runs a pipeline, it compiles the pipeline components to create a set of actionable instances and contains all the information needed to perform a specific task
- Data Pipeline provides a durable and robust data management as it retries a failed operation depending on frequency & defined number for retries
Task Runners
- A task runner is an application that polls AWS Data Pipeline for tasks and then performs those tasks
- When Task Runner is installed and configured,
- it polls AWS Data Pipeline for tasks associated with activated pipelines
- after a task is assigned to Task Runner, it performs that task and reports its status back to AWS Data Pipeline.
- A task is a discreet unit of work that the Data Pipeline service shares with a task runner and differs from a pipeline, which defines activities and resources that usually yields several tasks
- Tasks can be executed either on the AWS Data Pipeline managed or user managed resources
Data Nodes
- Data Node defines the location and type of data that a pipeline activity uses as source (input) or destination (output)
- Data pipeline supports S3, Redshift, DynamoDB and SQL data nodes
Databases
- Data Pipeline supports JDBC, RDS and Redshift database
Activities
- An activity is a pipeline component that defines the work to perform
- Data Pipeline provides pre defined activities for common scenarios like sql transformation, data movement, hive queries etc
- Activities are extensible and can be used to run own custom scripts to support endless combinations
Preconditions
- Precondition is a pipeline component containing conditional statements that must be satisfied (evaluated to True) before an activity can run
- A pipeline supports
- System-managed preconditions
- are run by the AWS Data Pipeline web service on your behalf and do not require a computational resource
- Includes source data and keys check for e.g. DynamoDB data, table exists or S3 key exists or prefix not empty
- User-managed preconditions
- run on user defined and managed computational resources
- Can be defined as Exists check or Shell command
- System-managed preconditions
Resources
- A resource is the computational resource that performs the work that a pipeline activity specifies
- Data Pipeline supports AWS Data Pipeline-managed and self-managed resources
- AWS Data Pipeline-managed resources include EC2 and EMR, which are launched by the Data Pipeline service only when they’re needed
- Self managed on-premises resources can also be used, where a Task Runner package is installed which continuously polls the AWS Data Pipeline service for work to perform
- Resources can run in the same region as their working data set or even on a region different than AWS Data Pipeline
- Resources launched by AWS Data Pipeline are counted within the resource limits and should be taken into account
Actions
- Actions are steps that a pipeline takes when a certain event like success, failure occurs.
- Pipeline supports SNS notifications and termination action on resources