Brian Brazil's Blog

December 28, 2020

Monitoring is a means, not an end

Does it really have to be perfect? While there is a certain intellectual satisfaction to be had by getting a system just right, the purpose of monitoring is to help you run whatever it is that you're monitoring. That could be a website used directly by your customers, a backend only used internally, or even […]
 •  0 comments  •  flag
Share on Twitter
Published on December 28, 2020 01:15

December 21, 2020

Policy is for configuration, not metric names

Metric names are part of a time series's identity, so shouldn't include information unrelated to identity. I previously looked at some of the consideration for choosing target labels. Similar applies to the names of metrics, you want something descriptive but not longer than it needs to be. In this context, I'd like to talk about […]
 •  0 comments  •  flag
Share on Twitter
Published on December 21, 2020 02:29

December 14, 2020

Prefer without and ignoring

Which of by/without and on/ignoring should you use? When writing a given bit of PromQL you know what labels you want in the output of an aggregation, so why not put them in the by ? Similarly when doing vector matching and using on. So why ever use without and ignoring? Application architectures and deployment []
 •  0 comments  •  flag
Share on Twitter
Published on December 14, 2020 01:40

December 7, 2020

Choosing your pushgateway grouping key

What does and doesn't make a good grouping key? Pushgateway grouping keys are fundamentally target labels, and similar considerations apply. They should be minimal and they should be constant. What the latter means may be a little non-obvious for batch jobs though. The purpose of the Pushgateway is to hold metrics from the end of []
 •  0 comments  •  flag
Share on Twitter
Published on December 07, 2020 03:15

November 30, 2020

New Features in Prometheus 2.23.0

Prometheus 2.23.0 is now out, following on from 2.22.0 with many fixes and improvements. There's been a variety of performance improvements to the TSDB. Compaction will now be faster, there will only be one checkpoint made after Prometheus restarts after not running for a good while, and the series API no longer loads any chunks. Remote write can now […]
 •  0 comments  •  flag
Share on Twitter
Published on November 30, 2020 02:25

November 23, 2020

OpenMetrics is released

Coming soon to a standards body near you. OpenMetrics has been regularly worked on since June 2017. The goal is to take the Prometheus exposition text format which is a defacto standard, and make it into a cleaner vendor-neutral standard. As of earlier this month, the standard is now available! The next steps are to bring […]
 •  0 comments  •  flag
Share on Twitter
Published on November 23, 2020 04:12

November 16, 2020

ARP cache metrics from the node exporter

The node exporter has metrics about the ARP table. The ARP cache is part of how computers figure out which IPv4 addresses match up with which MAC or hardware addresses, so that packets can be efficiently switched on local network segments. The contents of this cache can be seen in /proc/net/arp, or in a more […]
 •  0 comments  •  flag
Share on Twitter
Published on November 16, 2020 03:01

November 9, 2020

Why is my SNMP string showing as hexadecimal?

Why is my name showing as 0x79206e616d65? Strings in Prometheus are full UTF-8, however not all byte sequences are valid UTF-8. For example the byte 0xff cannot appear in a valid UTF-8 string. What does this have to do with SNMP? If we get arbitrary bytes that could in principle be anything, then we have […]
 •  0 comments  •  flag
Share on Twitter
Published on November 09, 2020 02:28

November 2, 2020

Checking for specific HTTP status codes with the blackbox exporter

How would you check that a HTTP endpoint is returning a 204? We've previously looked at using the blackbox exporter to check that a 2xx response code is being returned, you might however consider only particular 2xx codes as okay. Your first thought might be to use an alert based on the probe_http_status_code metric, however […]
 •  0 comments  •  flag
Share on Twitter
Published on November 02, 2020 01:51

October 26, 2020

Show multiple expressions for an instance in a Grafana table

Have you ever wanted to have a table showing multiple metrics across all of your instances? I'm going to show you how to show CPU usage and RSS for all of your instances, here I'm using Grafana 7.1.1 and the end result will look something like:   First create a Table panel, and define your […]
 •  0 comments  •  flag
Share on Twitter
Published on October 26, 2020 01:49