Marathon uses Dropwizard Metrics
for its metrics. You can query the current metric values via the
/metrics HTTP endpoint.
For the specific syntax see the metrics command-line flags section.
Although we try to prevent unnecessary disruptions, we do not provide stability guarantees for metric names between major and minor releases.
Marathon has the following metric types:
counter is a monotonically increasing integer, for instance, the
number of Mesos revive calls performed since Marathon became
a leader.gauge is a current measurement, for instance, the number of apps
currently known to Marathon.histogram is a distribution of values in a stream of measurements,
for instance, the number of apps in group deployments.meter measures the rate at which a set of events occur.timer is a combination of a meter and a histogram, which measure
the duration of events and the rate of their occurrence.Histograms and timers are backed with reservoirs leveraging HdrHistogram.
A metric measures something either in abstract quantities, or in the following units:
bytessecondsAll metric names are prefixed with marathon by default. The prefix can
be changed using --metrics_name_prefix command-line flag.
Metric name components are joined with dots. Components may have dashes in them.
A metric type and a unit of measurement (if any) are appended to a metric name. A couple of examples:
marathon.apps.active.gaugemarathon.http.event-streams.responses.size.counter.bytesThe Prometheus reporter is enabled by default, and it can be disabled
with --disable_metrics_prometheus command-line flag. Metrics in the
Prometheus format are available at /metrics/prometheus.
Dots and dashes in metric names are replaced with underscores.
The StatsD reporter can be enabled with --metrics_statsd command-line
flag. It sends metrics over UDP to the host and port specified with
--metrics_statsd_host and --metrics_statsd_port respectively.
Counters are forwarded as gauges.
The DataDog reporter can be enabled with --metrics_datadog
command-line flag. It sends metrics over UDP to the host and port
specified with --metrics_datadog_host and --metrics_datadog_port
respectively.
Marathon can send metrics to a DataDog agent over UDP, or directly to
the DataDog cloud over HTTP. It is specified using
--metrics_datadog_protocol. Its possible values are udp (default)
and api. If api is chosen, your DataDog API key can be supplied with
--metrics_datadog_api_key.
Dashes in metric names are replaced with underscores. Counters are forwarded as gauges.
marathon.apps.active.gauge — the number of active apps.marathon.deployments.active.gauge — the number of active
deployments.marathon.deployments.counter — the count of deployments received
since the current Marathon instance became a leader.marathon.deployments.dismissed.counter — the count of deployments
dismissed since the current Marathon instance became a leader;
a deployment might be dismissed by Marathon, when there are too many
concurrent deployments.marathon.groups.active.gauge — the number of active groups.marathon.instances.running.gauge — the number of running instances
at the moment.marathon.instances.staged.gauge — the number of instances staged at
the moment.marathon.leadership.duration.gauge.seconds — the duration of
current leadership.marathon.persistence.gc.runs.counter — the count of Marathon GC runs
since it became a leader.marathon.persistence.gc.compaction.duration.timer.seconds —
a histogram of Marathon GC compaction phase durations, and a meter for
compaction durations.marathon.persistence.gc.scan.duration.timer.seconds — a histogram of
Marathon GC scan phase durations, and a meter for scan durations.marathon.pods.active.gauge — the number of active pods.marathon.tasks.launched.counter — the count of tasks launched by
the current Marathon instance since it became a leader.marathon.uptime.gauge.seconds — uptime of the current Marathon
instance.marathon.instances.launch-overdue.gauge - the number of overdue instances. Overdue
instances are those that has been either launched or staged but hasn’t
become running within the task_launch_timeout or task_launch_confirmation_timeout
respectively.marathon.instances.inflight-kills.gauge - the current number of instances for which
Marathon has sent a kill request but hasn’t received a confirmation from
Mesos yet.marathon.instances.inflight-kill-attempts.gauge - the total number of kill attempts that
Marathon has sent for all currently existing doomed instances.marathon.tasks.launched.counter — the count of Mesos tasks
launched by the current Marathon instance since it became a leader.marathon.mesos.calls.revive.counter — the count of Mesos revive
calls made since the current Marathon instance became a leader.marathon.mesos.calls.suppress.counter — the count of Mesos
suppress calls made since the current Marathon instance became
a leader.marathon.mesos.offer-operations.launch-group.counter — the count of
LaunchGroup offer operations made since the current Marathon
instance became a leader.marathon.mesos.offer-operations.launch.counter — the count of
Launch offer operations made since the current Marathon instance
became a leader.marathon.mesos.offer-operations.reserve.counter — the count of
Reserve offer operations made since the current Marathon instance
became a leader.marathon.mesos.offers.declined.counter — the count of offers
declined since the current Marathon instance became a leader.marathon.mesos.offers.incoming.counter — the count of offers
received since the current Marathon instance became a leader.marathon.mesos.offers.used.counter — the count of offers used since
the current Marathon instance became a leader.marathon.mesos.task-updates.<task-state>.counter — the count of task
status updates received per each task state (task-running,
task-failed and so on).marathon.mesos.task.oom.memory-limit-unfulfillable - The count of
tasks OOM-killed because, while the container stayed within its
specified memory limit, the agent was too memory constrained to
support it.marathon.mesos.task.oom.memory-limit-exceeded - The count of tasks
OOM-killed because it exceeded the specified memory limit (or,
request, if limit was not specified).marathon.http.responses.event-stream.size.counter.bytes — the size
of data sent to clients over event streams since the current Marathon
instance became a leader.marathon.http.requests.size.counter.bytes — the total size of
all requests since the current Marathon instance became a leader.marathon.http.responses.size.counter.bytes — the total size of all
responses since the current Marathon instance became a leader.marathon.http.responses.size.gzipped.counter.bytes — the total size
of all gzipped responses since the current Marathon instance became
a leader.marathon.http.requests.active.gauge — the number of active requests.marathon.http.responses.1xx.rate.meter — the rate of 1xx responses.marathon.http.responses.2xx.rate.meter — the rate of 2xx responses.marathon.http.responses.3xx.rate.meter — the rate of 3xx responses.marathon.http.responses.4xx.rate.meter — the rate of 4xx responses.marathon.http.responses.5xx.rate.meter — the rate of 5xx responses.marathon.http.requests.duration.timer.seconds — a histogram of
request durations, and a meter for request durations.marathon.http.requests.get.duration.timer.seconds — the same but for
GET requests only.marathon.http.requests.post.duration.timer.seconds — the same but
for POST requests only.marathon.http.requests.put.duration.timer.seconds — the same but for
PUT requests only.marathon.http.requests.delete.duration.timer.seconds — the same but
for DELETE requests only.marathon.jvm.buffers.mapped.gauge — an estimate of the number of
mapped buffers.marathon.jvm.buffers.mapped.capacity.gauge.bytes — an estimate of
the total capacity of the mapped buffers in bytes.marathon.jvm.buffers.mapped.memory.used.gauge.bytes an estimate of
the memory that the JVM is using for mapped buffers in bytes, or -1L
if an estimate of the memory usage is not available.marathon.jvm.buffers.direct.gauge — an estimate of the number of
direct buffers.marathon.jvm.buffers.direct.capacity.gauge.bytes — an estimate of
the total capacity of the direct buffers in bytes.marathon.jvm.buffers.direct.memory.used.gauge.bytes an estimate of
the memory that the JVM is using for direct buffers in bytes, or -1L
if an estimate of the memory usage is not available.marathon.jvm.gc.<gc>.collections.gauge — the total number
of collections that have occurredmarathon.jvm.gc.<gc>.collections.duraration.gauge.seconds — the
approximate accumulated collection elapsed time, or -1 if the
collection elapsed time is undefined for the given collector.marathon.jvm.memory.total.init.gauge.bytes - the amount of memory
in bytes that the JVM initially requests from the operating system
for memory management, or -1 if the initial memory size is
undefined.marathon.jvm.memory.total.used.gauge.bytes - the amount of used
memory in bytes.marathon.jvm.memory.total.max.gauge.bytes - the maximum amount of
memory in bytes that can be used for memory management, -1 if the
maximum memory size is undefined.marathon.jvm.memory.total.committed.gauge.bytes - the amount of
memory in bytes that is committed for the JVM to use.marathon.jvm.memory.heap.init.gauge.bytes - the amount of heap
memory in bytes that the JVM initially requests from the operating
system for memory management, or -1 if the initial memory size is
undefined.marathon.jvm.memory.heap.used.gauge.bytes - the amount of used heap
memory in bytes.marathon.jvm.memory.heap.max.gauge.bytes - the maximum amount of
heap memory in bytes that can be used for memory management, -1 if
the maximum memory size is undefined.marathon.jvm.memory.heap.committed.gauge.bytes - the amount of heap
memory in bytes that is committed for the JVM to use.marathon.jvm.memory.heap.usage.gauge - the ratio of
marathon.jvm.memory.heap.used.gauge.bytes and
marathon.jvm.memory.heap.max.gauge.bytes.marathon.jvm.memory.non-heap.init.gauge.bytes - the amount of
non-heap memory in bytes that the JVM initially requests from the
operating system for memory management, or -1 if the initial memory
size is undefined.marathon.jvm.memory.non-heap.used.gauge.bytes - the amount of used
non-heap memory in bytes.marathon.jvm.memory.non-heap.max.gauge.bytes - the maximum amount of
non-heap memory in bytes that can be used for memory management, -1
if the maximum memory size is undefined.marathon.jvm.memory.non-heap.committed.gauge.bytes - the amount of
non-heap memory in bytes that is committed for the JVM to use.marathon.jvm.memory.non-heap.usage.gauge - the ratio of
marathon.jvm.memory.non-heap.used.gauge.bytes and
marathon.jvm.memory.non-heap.max.gauge.bytes.marathon.jvm.threads.active.gauge — the number of active threads.marathon.jvm.threads.daemon.gauge — the number of daemon threads.marathon.jvm.threads.deadlocked.gauge — the number of deadlocked
threads.marathon.jvm.threads.new.gauge — the number of threads in NEW
state.marathon.jvm.threads.runnable.gauge — the number of threads in
RUNNABLE state.marathon.jvm.threads.blocked.gauge — the number of threads in
BLOCKED state.marathon.jvm.threads.timed-waiting.gauge — the number of threads in
TIMED_WAITING state.marathon.jvm.threads.waiting.gauge — the number of threads in
WAITING state.marathon.jvm.threads.terminated.gauge — the number of threads in
TERMINATED state.Let’s consider a few examples of alerting rules. We will use Prometheus as our alerting tool, but please keep in mind that any alerting service of your choice can be used instead.
Please note that particular threshold values are dependent on many variables like the number of applications, a cluster size and so on.
For the syntax of Prometheus alerting rules please refer to its official documentation.
Alert whenever a Marathon instance goes down or restarts:
- alert: Marathon Uptime
expr: up{job="marathon"} == 0
for: 30s
labels:
severity: warning
annotations:
summary: It fires when a Marathon instance stops or restarts.
Alert when there are less than 2 out of 3 Marathon instances that are up and running:
- alert: Marathon Too Few Instances
expr: sum(up{job="marathon"}) < 2
for: 30s
labels:
severity: critical
annotations:
summary: It fires when there are fewer Marathon instances up and running than expected (< 2 instances).
Alert if Marathon’s heap usage is too high:
- alert: Marathon Heap Usage
expr: avg_over_time(marathon_jvm_memory_heap_used_gauge_bytes[10s]) > (2 * 1024 * 1024 * 1024)
for: 10s
labels:
severity: critical
annotations:
summary: It fires when Marathon's heap is too big (> 2GB).
Alert if Marathon has too many threads:
- alert: Marathon Threads
expr: avg_over_time(marathon_jvm_threads_active_gauge[10s]) > 100
for: 10s
labels:
severity: critical
annotations:
summary: It fires when Marathon has too many threads (> 100 threads).
Alert if Marathon dismisses deployments way too often:
- alert: Marathon Dismissed Deployments
expr: rate(marathon_deployments_dismissed_counter[1m]) > 3
for: 1m
labels:
severity: warning
annotations:
summary: It fires if there are too many deployments dismissed by Marathon (> 3 deployments over the last minute).
Alert if Marathon receives offers, but is unable to use them for some reason, i.e. insufficient resources:
- alert: Marathon Given No Suitable Offer
expr: rate(marathon_mesos_offers_used_counter[10m]) == 0 and rate(marathon_mesos_offers_incoming_counter[10m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: It fires if there are incoming Mesos offers, but none can be used by Marathon (for the last 10 minutes).
Alert if there are too many concurrent HTTP requests to Marathon:
- alert: Marathon HTTP Requests
expr: avg_over_time(marathon_http_requests_active_gauge[10s]) > 100
for: 10s
labels:
severity: critical
annotations:
summary: It fires if there are too many concurrent HTTP requests to Marathon (> 100 concurrent connections).
Alert if Marathon responds with a 5xx status code:
- alert: Marathon HTTP 5xx Responses
expr: marathon_http_responses_5xx_rate_m1_rate_meter > 0
for: 10s
labels:
severity: critical
annotations:
summary: It fires if Marathon responds with a 5xx status codes.
Alert if Marathon has too many staged instances, which might indicate that Marathon does not receive offers with enough resources or
- alert: Marathon Too Many Staged Instances
expr: marathon_instances_staged_gauge > 10
for: 10s
labels:
severity: warning
annotations:
summary: It fires if Marathon has too many staged instances (> 10 staged instances).