EOX GitLab Instance
Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
P
prometheus role
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
ansible-public
prometheus role
Commits
e7fe0718
Commit
e7fe0718
authored
2 years ago
by
Karl Grube
Browse files
Options
Downloads
Patches
Plain Diff
added some alerts
parent
39a50506
Branches
Branches containing commit
No related tags found
No related merge requests found
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
files/core_alerts.yml
+130
-0
130 additions, 0 deletions
files/core_alerts.yml
tasks/all.yml
+7
-0
7 additions, 0 deletions
tasks/all.yml
with
137 additions
and
0 deletions
files/core_alerts.yml
0 → 100644
+
130
−
0
View file @
e7fe0718
groups
:
-
name
:
core
rules
:
# Prometheus Base Rules
-
alert
:
PrometheusTargetMissing
expr
:
up ==
0
for
:
0m
labels
:
severity
:
critical
annotations
:
summary
:
Prometheus target missing (instance {{ $labels.instance }})
description
:
"
A
Prometheus
target
has
disappeared.
An
exporter
might
be
crashed.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-
alert
:
PrometheusConfigurationReloadFailure
expr
:
prometheus_config_last_reload_successful !=
1
for
:
0m
labels
:
severity
:
warning
annotations
:
summary
:
Prometheus configuration reload failure (instance {{ $labels.instance }})
description
:
"
Prometheus
configuration
reload
error
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-
alert
:
PrometheusTooManyRestarts
expr
:
changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) >
2
for
:
0m
labels
:
severity
:
warning
annotations
:
summary
:
Prometheus too many restarts (instance {{ $labels.instance }})
description
:
"
Prometheus
has
restarted
more
than
twice
in
the
last
15
minutes.
It
might
be
crashlooping.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
# End Prometheus Base Rules
-
alert
:
HostOutOfMemory
expr
:
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 <
10
for
:
2m
labels
:
severity
:
warning
annotations
:
summary
:
Host out of memory (instance {{ $labels.instance }})
description
:
"
Node
memory
is
filling
up
(<
10%
left)
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-
alert
:
HostOutOfDiskSpace
expr
:
(node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly ==
0
for
:
2m
labels
:
severity
:
warning
annotations
:
summary
:
Host out of disk space (instance {{ $labels.instance }})
description
:
"
Disk
is
almost
full
(<
10%
left)
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-
alert
:
HostDiskWillFillIn24Hours
expr
:
(node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly ==
0
for
:
2m
labels
:
severity
:
warning
annotations
:
summary
:
Host disk will fill in 24 hours (instance {{ $labels.instance }})
description
:
"
Filesystem
is
predicted
to
run
out
of
space
within
the
next
24
hours
at
current
write
rate
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-
alert
:
HostCpuHighIowait
expr
:
avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 >
5
for
:
0m
labels
:
severity
:
warning
annotations
:
summary
:
Host CPU high iowait (instance {{ $labels.instance }})
description
:
"
CPU
iowait
>
5%.
A
high
iowait
means
that
you
are
disk
or
network
bound.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-
alert
:
HostNodeOvertemperatureAlarm
expr
:
node_hwmon_temp_crit_alarm_celsius ==
1
for
:
0m
labels
:
severity
:
critical
annotations
:
summary
:
Host node overtemperature alarm (instance {{ $labels.instance }})
description
:
"
Physical
node
temperature
alarm
triggered
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-
alert
:
HostRaidArrayGotInactive
expr
:
node_md_state{state="inactive"} >
0
for
:
0m
labels
:
severity
:
critical
annotations
:
summary
:
Host RAID array got inactive (instance {{ $labels.instance }})
description
:
"
RAID
array
{{
$labels.device
}}
is
in
degraded
state
due
to
one
or
more
disks
failures.
Number
of
spare
drives
is
insufficient
to
fix
issue
automatically.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-
alert
:
HostRaidDiskFailure
expr
:
node_md_disks{state="failed"} >
0
for
:
2m
labels
:
severity
:
warning
annotations
:
summary
:
Host RAID disk failure (instance {{ $labels.instance }})
description
:
"
At
least
one
device
in
RAID
array
on
{{
$labels.instance
}}
failed.
Array
{{
$labels.md_device
}}
needs
attention
and
possibly
a
disk
swap
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-
alert
:
HostOomKillDetected
expr
:
increase(node_vmstat_oom_kill[1m]) >
0
for
:
0m
labels
:
severity
:
warning
annotations
:
summary
:
Host OOM kill detected (instance {{ $labels.instance }})
description
:
"
OOM
kill
detected
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-
alert
:
HostNetworkReceiveErrors
expr
:
rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) >
0.01
for
:
2m
labels
:
severity
:
warning
annotations
:
summary
:
Host Network Receive Errors (instance {{ $labels.instance }})
description
:
"
Host
{{
$labels.instance
}}
interface
{{
$labels.device
}}
has
encountered
{{
printf
\"
%.0f
\"
$value
}}
receive
errors
in
the
last
two
minutes.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-
alert
:
HostNetworkTransmitErrors
expr
:
rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) >
0.01
for
:
2m
labels
:
severity
:
warning
annotations
:
summary
:
Host Network Transmit Errors (instance {{ $labels.instance }})
description
:
"
Host
{{
$labels.instance
}}
interface
{{
$labels.device
}}
has
encountered
{{
printf
\"
%.0f
\"
$value
}}
transmit
errors
in
the
last
two
minutes.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-
alert
:
HostClockNotSynchronising
expr
:
min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >=
16
for
:
2m
labels
:
severity
:
warning
annotations
:
summary
:
Host clock not synchronising (instance {{ $labels.instance }})
description
:
"
Clock
not
synchronising.
Ensure
NTP
is
configured
on
this
host.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-
alert
:
ContainerVolumeUsage
expr
:
(1 - (sum(container_fs_inodes_free{name!=""}) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 >
80
for
:
2m
labels
:
severity
:
warning
annotations
:
summary
:
Container Volume usage (instance {{ $labels.instance }})
description
:
"
Container
Volume
usage
is
above
80%
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
This diff is collapsed.
Click to expand it.
tasks/all.yml
+
7
−
0
View file @
e7fe0718
...
...
@@ -40,3 +40,10 @@
name
:
prometheus
state
:
started
enabled
:
yes
-
name
:
core alerts
copy
:
src
:
core_alerts.yml
dest
:
/etc/prometheus/alerts/core.yml
tags
:
alerts
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment