A complete observability stack is essential when supporting code in production.
This guide introduces some basic observability concepts and demonstrates how to get some essential tooling up and running. We’ll be setting up structured logging and prometheus metrics, exporting both and visualising them via Grafana, as well as alerting on frontend errors with Sentry.
Tools like these are your eyes and ears. Here’s how to set them up.
Our stack
The backend we’ll be monitoring is written in Golang and hosted in AWS’s ElasticBeanstalk (EB). This is their managed version of Elastic Container Service (ECS), allowing one versioned “Application” to be deployed to and scaled horizontally on many environments.
An EB Application can run a pure binary or a collection of docker images, orchestrated via Docker Compose. By running many images in a Docker container we achieve flexibility similar to that of a Kubernetes Pod, easily adding sidecars as we please.
On the frontend we have a NextJS app deployed via Vercel.
For observability tooling, we’ll need a free Grafana Cloud account. This lets us store logs in Loki, store metrics in Prometheus, and query both via Grafana.
We’ll also be signing up for Sentry on the free “Developer” plan.
Higher paid tiers are required for advanced features such as collaboration and extended storage retention.
Logging
Used correctly, logs are an effective tool for detecting problems in your application.
Signals and Noise
The frequency of logs and their specified level determines the signal to noise ratio. Noisy logs will result in fatigue, and potentially serious issues may accidentally go unnoticed.
We enforce the following rules to maintain a high signal to noise ratio:
- Error: Something is unexpectedly and seriously wrong. It needs attention.
- Warn: Something is off. It deserves attention.
- Info: Announce something without alerting anyone. Useful information.
- Typically should not scale with the number of requests. Use metrics instead.
- Debug: For paths which may need regular local testing.
- Typically these are only enabled in testing environments, meaning use-cases are few.
It’s worth auditing the quality of your logs before going into production to ensure you aren’t swamped in noise at the first sight of traffic.
Structured logs
These enable log messages to be accompanied by key-value pairs, or labels. For example:
Labels allows logs to be easily and efficiently indexed and filtered.
This is particularly essential if using Grafana Loki, which only indexes the labels, rather than indexing the whole message. Alternative solutions such as ElasticSearch (used with Kibana) allow complex searches on the entire message, requiring more complex and expensive indexes.
We use zap to produce structured logs.
Scraping logs
Here’s what happens:
- The Golang binary writes logs by default to stdout
- A Promtail container periodically reads this log output from disk and pushes it to Grafana Loki
- The Grafana UI queries Loki, allowing for filtering, searching, and dashboard building.
Alongside our app, we attach a Promtail container …
… and configure it in a nested directory. See the Promtail docs for more details.
Here we
- Serve requests on random ports. (There’s also a disable flag)
- Define a local positions file to save scraping progress
- Define the connection information for our Loki instance in Grafana Cloud
- Define where to read logs from, and which logs to read. We find our logs in /var/lib/docker/containers and only scrape logs from containers with label logging: promtail.
Observe!
Export the necessary environment variables and test it out with docker compose up. Then trigger some logs and check out Grafana.
Your logs should be visible in the “Explore” section, with your Loki instance selected as the data source.
Modify and try the following queries:
Alerting
Next, it would be wise to make some serious noise if any Error logs are produced.
In Grafana, under the “Alerts” section, we can create an alert rule on the presence of error level logs.
Note that this alert policy will have available label env.
We then need to configure somewhere to send the alert. This is found under “Alerting” > “Contact Points”. Add a simple email contact point to begin.
Each triggered alert rule is routed to a Contact Point using “Notification policies”. These policies branch on the labels available on the alert rule, for example env.
We can use Notification Policies to route all alerts with env=prod to a production support mailing list, and all alerts with env=staging to a staging list.
Metrics
Adding prometheus metrics to our application unlocks a huge amount of potential. We can track request latencies, observe throughput and queue backlogs, or simply see how much traffic our app is handling.
Metrics can be accumulated in a range of data structures, for example
- Counters, for continually increasing metrics
- Gauges, for increasing and decreasing metrics
- Histograms, for bucketed times
Each data type also has a “vector” equivalent, allowing it to be segmented via labels.
<aside>⚠️ Label keys and values should be chosen from a constant and limited set, in order to reduce strain on indexes in prometheus. I.e. don’t create a new label for every request.
</aside>
Recording metrics
Our app needs to serve a /metrics endpoint.
package prometheusimport ( "context" "net/http" "github.com/prometheus/client_golang/prometheus/promhttp" "userled.io/common/log")func ListenAndServe(ctx context.Context) { log.Info(ctx, "Serving prometheus metrics at /metrics") http.Handle("/metrics", promhttp.Handler()) if err := http.ListenAndServe(":2112", nil); err != nil { log.Error(ctx, "error serving metrics", "error", err) }}
Imagine a generic processor which consumes events from a queue. We can maintain a counter of how many events this processor has seen.
package processorimport ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto")var eventsCounter = promauto.NewCounterVec( prometheus.CounterOpts{ Namespace: "userled", Name: "events_total", Help: "The total number of events received by X processor", }, []string{"type", "org_id"},)func recordEventMetrics(eventType, orgId string) { eventsCounter.With(prometheus.Labels{ "type": eventType, "org_id": orgId, }).Inc()}
Then we simply call recordEventMetrics each time we see an event.
The produced metric can be viewed in full, or filtered on by its labels. It can be seen as an incrementing counter over time, or it can be judged by its rate of increase.
Add counters, gauges and histograms to your hearts desire. But get hooked up to Grafana first …
Scraping metrics
Does this look familiar?
Let’s add a prometheus container to our app.
# ./docker-compose.yamlversion: "3.8"services: backend: ... ports: - 2112:2112 # ensure you expose your metrics port depends_on: - promtail - prometheus promtail: ... prometheus: image: prom/prometheus:v2.41.0 ports: - 9000:9090 volumes: - ./prometheus:/etc/prometheus command: | --config.file=/etc/prometheus/prometheus.yaml# ./prometheus/prometheus.yamlglobal: scrape_interval: 15s scrape_timeout: 10s external_labels: env: <ENV_NAME>remote_write: - url: <https://prometheus-your-url.grafana.net/api/prom/push> basic_auth: username: <PROMETHEUS_USER> password: <PROMETHEUS_PASSWORD>scrape_configs: - job_name: scrape static_configs: - targets: - prometheus:9090 - backend:2112
Here we
- Scrape metrics every 15s, with a timeout of 10s
- Inject the environment name and auth credentials using sed , as prom does not support environment variables 🙃. This is handled by our CircleCI config
- Define the remote prometheus store to push metrics to
- Define two containers to scrape: prometheus and backend
Profit 📈
Beautify in Grafana. It’s that simple
Frontend Alerting
Let’s not forget about our frontend - it’s the face of our app! If this errors, we need to know.
Ideally we could push key frontend logs to Loki and visualise them through Grafana, keeping all observability tools in one place, however this dream did not play well with our deployment setup.
We are fortunately huge Sentry fans. Sentry rounds up logs from client browsers, edge services as well as your frontend deployment, and send alerts to Slack when any errors are logged.
I won’t attempt to improve their installation process because it’s already handled flawlessly by them. The NextJS integration even involves an installation wizard ✨.
Closing Words
This post only scratches the surface of observability - we completely omit tracing!
Let us know if you’d like to see a part 2 on this.
For further reading, Google’s Site Reliability Engineering textbook is a great place to start. Note that it’s not just for SREs - everyone involved in the creation and maintenance of production software needs to understand the best practices in order to do their job well.