Apache Airflow Architecture: A Detailed Overview

Oct 8th, 2024

Apache Airflow is a powerful open-source platform used to programmatically author, schedule, and monitor workflows. It is designed for complex data engineering tasks, pipeline automation, and orchestrating multiple processes. This article will break down Airflow’s architecture and provide a code example to help you understand how to work with it.

Alt text Source: Internet

Key Concepts in Airflow

Before diving into the architecture, let’s go over some important Airflow concepts:

DAG (Directed Acyclic Graph): The core abstraction in Airflow. A DAG represents a workflow, organized as a set of tasks that can be scheduled and executed.
Operator: A specific task within a DAG. There are various types of operators, including PythonOperator, BashOperator, and others.
Task: An individual step in a workflow.
Executor: Responsible for running tasks on the worker nodes.
Scheduler: Determines when DAGs and their tasks should run.
Web Server: Provides a UI for monitoring DAGs and tasks.
Metadata Database: Stores information about the DAGs and their run status.

Now that we’ve introduced these basic concepts, let’s look at Airflow’s architecture in detail.

Airflow Architecture

The Airflow architecture is based on a distributed model where different components handle specific responsibilities. The primary components are:

Scheduler

The Scheduler is the heart of Airflow. It is responsible for determining when a task should run based on the scheduling interval defined in a DAG. It monitors all active DAGs and adds tasks to the execution queue.

DAG Parsing: The scheduler continuously parses DAG files to check for changes or new DAGs.
Task Queueing: It places tasks that need execution in a queue.

Executor

The Executor is responsible for running the tasks that the scheduler assigns to it. Different types of executors can be used, depending on the scale and complexity of the environment.

SequentialExecutor: Useful for development and debugging, but can only run one task at a time.
LocalExecutor: Runs tasks in parallel on the local machine.
CeleryExecutor: Uses Celery and Redis or RabbitMQ to run tasks in parallel across multiple worker nodes.

Workers

Workers are the machines where the tasks are executed. In larger deployments, workers are distributed across multiple machines to handle high workloads efficiently. Workers receive tasks from the executor and execute them.

Web Server

The Web Server provides an interface for users to monitor and manage the execution of workflows. This is built on Flask and provides a rich UI to visualize DAGs, track task statuses, logs, etc.

Metadata Database

Airflow uses a relational database (e.g., PostgreSQL, MySQL) as the metadata store. It holds details about DAGs, task instances, users, connections, variables, and other essential metadata.

Flower

Flower is an optional component that can be used with the CeleryExecutor to monitor worker nodes and tasks in real-time.

Message Broker (For CeleryExecutor)

In a setup using CeleryExecutor, a message broker (RabbitMQ, Redis) is used to manage communication between the scheduler, executor, and workers.

DagBag

DagBag is the collection of all the DAGs that are active and ready to be scheduled by the scheduler. Every time a new DAG file is added or updated, it is added to the DagBag for execution.

Typical Workflow

Authoring DAGs: DAGs are Python scripts that define the workflow. The user defines a set of tasks (using operators) and their dependencies.
Scheduler Monitoring: The scheduler parses the DAGs and determines when they should be run based on the defined scheduling intervals (e.g., daily, hourly).
Task Queuing: Tasks that are ready for execution are placed in a queue by the scheduler.
Execution by Workers: The executor pulls tasks from the queue and assigns them to worker nodes for execution.
Task Tracking: As tasks are executed, the metadata database is updated with the task status (e.g., success, failure).
Monitoring via Web UI: The status of DAGs and tasks can be monitored in real-time using the web server.

Code Example

Let’s create a basic DAG that uses a PythonOperator to run a Python function.

DAG Definition

from datetime import timedelta, datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

# Define a simple Python function to be used in the DAG
def my_task():
    print("Hello from Apache Airflow!")

# Define default arguments for the DAG
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2024, 10, 7),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# Initialize the DAG
dag = DAG(
    'my_first_dag',
    default_args=default_args,
    description='A simple DAG',
    schedule_interval=timedelta(days=1),
)

# Define a PythonOperator that will run the Python function
task = PythonOperator(
    task_id='print_hello',
    python_callable=my_task,
    dag=dag,
)

Breakdown of the Code

DAG Definition: We start by defining the DAG, including its start_date, schedule, and default arguments.
PythonOperator: The PythonOperator is used to run the Python function my_task as a task in the DAG.
Scheduling: In this case, the DAG is scheduled to run once per day.

Running the DAG

Place the DAG file in your Airflow DAGs folder (typically located at /airflow/dags).
Start the Airflow scheduler using the command: bash airflow scheduler
Access the Airflow UI by starting the web server: bash airflow webserver Navigate to localhost:8080 to monitor and trigger your DAG.

Conclusion

Apache Airflow is a flexible and scalable platform for orchestrating workflows. Its modular architecture—comprising the scheduler, workers, web server, and metadata database—makes it ideal for managing complex data pipelines in distributed environments. The ability to define DAGs using Python, combined with its rich set of operators and scheduling capabilities, provides a powerful way to automate data workflows.

The provided code example shows how simple it is to define and run a task using PythonOperator. As you scale up, Airflow supports a range of executors and message brokers to handle more complex, distributed workloads efficiently.

By understanding Airflow’s architecture and seeing a basic example in action, you’re well on your way to using Airflow to manage and automate workflows in your projects.