Apache Aurora is a service scheduler that runs on top of Apache Mesos, enabling you to run long-running services, cron jobs, and ad-hoc jobs that take advantage of Apache Mesos’ scalability, fault-tolerance, and resource isolation.
It is important to have an understanding of the components that make up a functioning Aurora cluster.
Aurora scheduler The scheduler is your primary interface to the work you run in your cluster. You will instruct it to run jobs, and it will manage them in Mesos for you. You will also frequently use the scheduler’s read-only web interface as a heads-up display for what’s running in your cluster.
The client (
aurora command) is a command line tool that exposes primitives that you can use to
interact with the scheduler. The client operates on
Aurora also provides an admin client (
aurora_admin command) that contains commands built for
cluster administrators. You can use this tool to do things like manage user quotas and manage
graceful maintenance on machines in cluster.
The executor (a.k.a. Thermos executor) is responsible for carrying out the workloads described in
the Aurora DSL (
.aurora files). The executor is what actually executes user processes. It will
also perform health checking of tasks and register tasks in ZooKeeper for the purposes of dynamic
Aurora observer The observer provides browser-based access to the status of individual tasks executing on worker machines. It gives insight into the processes executing, and facilitates browsing of task sandbox directories.
ZooKeeper ZooKeeper is a distributed consensus system. In an Aurora cluster it is used for reliable election of the leading Aurora scheduler and Mesos master. It is also used as a vehicle for service discovery, see Service Discovery
Mesos master The master is responsible for tracking worker machines and performing accounting of their resources. The scheduler interfaces with the master to control the cluster.
Mesos agent The agent receives work assigned by the scheduler and executes them. It interfaces with Linux isolation systems like cgroups, namespaces and Docker to manage the resource consumption of tasks. When a user task is launched, the agent will launch the executor (in the context of a Linux cgroup or Docker container depending upon the environment), which will in turn fork user processes.
In earlier versions of Mesos and Aurora, the Mesos agent was known as the Mesos slave.
Aurora is a Mesos framework used to schedule jobs onto Mesos. Mesos
cares about individual tasks, but typical jobs consist of dozens or
hundreds of task replicas. Aurora provides a layer on top of Mesos with
Job abstraction. An Aurora
Job consists of a task template and
instructions for creating near-identical replicas of that task (modulo
things like “instance id” or specific port numbers which may differ from
machine to machine).
How many tasks make up a Job is complicated. On a basic level, a Job consists of one task template and instructions for creating near-identical replicas of that task (otherwise referred to as “instances” or “shards”).
A task can merely be a single process corresponding to a single
command line, such as
python2.7 my_script.py. However, a task can also
consist of many separate processes, which all run within a single
sandbox. For example, running multiple cooperating agents together,
installer, master, or agent processes. This is
where Thermos comes in. While Aurora provides a
Job abstraction on
top of Mesos
Tasks, Thermos provides a
Tasks and serves as part of the Aurora framework’s
Processes in a configuration file.
Configuration files are written in Python, and make use of the
Pystachio templating language,
along with specific Aurora, Mesos, and Thermos commands and methods.
The configuration files typically end with a
Task has a sandbox created when the
Task starts and garbage
collected when it finishes. All of a
Task's processes run in its
sandbox, so processes can share state by using a shared current working
The sandbox garbage collection policy considers many factors, most
importantly age and size. It makes a best-effort attempt to keep
sandboxes around as long as possible post-task in order for service
owners to inspect data and logs, should the
Task have completed
abnormally. But you can’t design your applications assuming sandboxes
will be around forever, e.g. by building log saving or other
checkpointing mechanisms directly into your application or into your