Userled | Build Your First Observability Stack Easily

min reading

home

BLOG

No items found.

Build Your First Observability Stack Easily

Table of Contents

Why personalisation matters now more than ever

Newsletter

1:1 personalized landing pages, in seconds.

Try it now

A complete observability stack is essential when supporting code in production.

This guide introduces some basic observability concepts and demonstrates how to get some essential tooling up and running. We’ll be setting up structured logging and prometheus metrics, exporting both and visualising them via Grafana, as well as alerting on frontend errors with Sentry.

Tools like these are your eyes and ears. Here’s how to set them up.

Our stack

How are we building userled?

The backend we’ll be monitoring is written in Golang and hosted in AWS’s ElasticBeanstalk (EB). This is their managed version of Elastic Container Service (ECS), allowing one versioned “Application” to be deployed to and scaled horizontally on many environments.

An EB Application can run a pure binary or a collection of docker images, orchestrated via Docker Compose. By running many images in a Docker container we achieve flexibility similar to that of a Kubernetes Pod, easily adding sidecars as we please.

On the frontend we have a NextJS app deployed via Vercel.

For observability tooling, we’ll need a free Grafana Cloud account. This lets us store logs in Loki, store metrics in Prometheus, and query both via Grafana.

We’ll also be signing up for Sentry on the free “Developer” plan.

Higher paid tiers are required for advanced features such as collaboration and extended storage retention.

Logging

Used correctly, logs are an effective tool for detecting problems in your application.

Signals and Noise

The frequency of logs and their specified level determines the signal to noise ratio. Noisy logs will result in fatigue, and potentially serious issues may accidentally go unnoticed.

We enforce the following rules to maintain a high signal to noise ratio:

Error: Something is unexpectedly and seriously wrong. It needs attention.
Warn: Something is off. It deserves attention.
Info: Announce something without alerting anyone. Useful information.
Typically should not scale with the number of requests. Use metrics instead.
Debug: For paths which may need regular local testing.
Typically these are only enabled in testing environments, meaning use-cases are few.

It’s worth auditing the quality of your logs before going into production to ensure you aren’t swamped in noise at the first sight of traffic.

Structured logs

These enable log messages to be accompanied by key-value pairs, or labels. For example:

Labels allows logs to be easily and efficiently indexed and filtered.

This is particularly essential if using Grafana Loki, which only indexes the labels, rather than indexing the whole message. Alternative solutions such as ElasticSearch (used with Kibana) allow complex searches on the entire message, requiring more complex and expensive indexes.

We use zap to produce structured logs.

Scraping logs

Here’s what happens:

The Golang binary writes logs by default to stdout
A Promtail container periodically reads this log output from disk and pushes it to Grafana Loki
The Grafana UI queries Loki, allowing for filtering, searching, and dashboard building.

Alongside our app, we attach a Promtail container …

… and configure it in a nested directory. See the Promtail docs for more details.

Here we

Serve requests on random ports. (There’s also a disable flag)
Define a local positions file to save scraping progress
Define the connection information for our Loki instance in Grafana Cloud
Define where to read logs from, and which logs to read. We find our logs in /var/lib/docker/containers and only scrape logs from containers with label logging: promtail.

Observe!

Export the necessary environment variables and test it out with docker compose up. Then trigger some logs and check out Grafana.

Your logs should be visible in the “Explore” section, with your Loki instance selected as the data source.

Modify and try the following queries:

Alerting

Next, it would be wise to make some serious noise if any Error logs are produced.

In Grafana, under the “Alerts” section, we can create an alert rule on the presence of error level logs.

Note that this alert policy will have available label env.

We then need to configure somewhere to send the alert. This is found under “Alerting” > “Contact Points”. Add a simple email contact point to begin.

Each triggered alert rule is routed to a Contact Point using “Notification policies”. These policies branch on the labels available on the alert rule, for example env.

We can use Notification Policies to route all alerts with env=prod to a production support mailing list, and all alerts with env=staging to a staging list.

Metrics

Adding prometheus metrics to our application unlocks a huge amount of potential. We can track request latencies, observe throughput and queue backlogs, or simply see how much traffic our app is handling.

Metrics can be accumulated in a range of data structures, for example

Counters, for continually increasing metrics
Gauges, for increasing and decreasing metrics
Histograms, for bucketed times

Each data type also has a “vector” equivalent, allowing it to be segmented via labels.

<aside>⚠️ Label keys and values should be chosen from a constant and limited set, in order to reduce strain on indexes in prometheus. I.e. don’t create a new label for every request.

</aside>

Recording metrics

Our app needs to serve a /metrics endpoint.

package prometheusimport ( "context" "net/http" "github.com/prometheus/client_golang/prometheus/promhttp" "userled.io/common/log")func ListenAndServe(ctx context.Context) { log.Info(ctx, "Serving prometheus metrics at /metrics") http.Handle("/metrics", promhttp.Handler()) if err := http.ListenAndServe(":2112", nil); err != nil { log.Error(ctx, "error serving metrics", "error", err) }}

Imagine a generic processor which consumes events from a queue. We can maintain a counter of how many events this processor has seen.

package processorimport ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto")var eventsCounter = promauto.NewCounterVec( prometheus.CounterOpts{ Namespace: "userled", Name: "events_total", Help: "The total number of events received by X processor", }, []string{"type", "org_id"},)func recordEventMetrics(eventType, orgId string) { eventsCounter.With(prometheus.Labels{ "type": eventType, "org_id": orgId, }).Inc()}

Then we simply call recordEventMetrics each time we see an event.

The produced metric can be viewed in full, or filtered on by its labels. It can be seen as an incrementing counter over time, or it can be judged by its rate of increase.

Add counters, gauges and histograms to your hearts desire. But get hooked up to Grafana first …

Scraping metrics

Does this look familiar?

Let’s add a prometheus container to our app.

# ./docker-compose.yamlversion: "3.8"services: backend: ... ports: - 2112:2112 # ensure you expose your metrics port depends_on: - promtail - prometheus promtail: ... prometheus: image: prom/prometheus:v2.41.0 ports: - 9000:9090 volumes: - ./prometheus:/etc/prometheus command: | --config.file=/etc/prometheus/prometheus.yaml# ./prometheus/prometheus.yamlglobal: scrape_interval: 15s scrape_timeout: 10s external_labels: env: <ENV_NAME>remote_write: - url: <https://prometheus-your-url.grafana.net/api/prom/push> basic_auth: username: <PROMETHEUS_USER> password: <PROMETHEUS_PASSWORD>scrape_configs: - job_name: scrape static_configs: - targets: - prometheus:9090 - backend:2112

Here we

Scrape metrics every 15s, with a timeout of 10s
Inject the environment name and auth credentials using sed , as prom does not support environment variables 🙃. This is handled by our CircleCI config
Define the remote prometheus store to push metrics to
Define two containers to scrape: prometheus and backend

Profit 📈

Beautify in Grafana. It’s that simple

Frontend Alerting

Let’s not forget about our frontend - it’s the face of our app! If this errors, we need to know.

Ideally we could push key frontend logs to Loki and visualise them through Grafana, keeping all observability tools in one place, however this dream did not play well with our deployment setup.

We are fortunately huge Sentry fans. Sentry rounds up logs from client browsers, edge services as well as your frontend deployment, and send alerts to Slack when any errors are logged.

I won’t attempt to improve their installation process because it’s already handled flawlessly by them. The NextJS integration even involves an installation wizard ✨.

Closing Words

This post only scratches the surface of observability - we completely omit tracing!

Let us know if you’d like to see a part 2 on this.

For further reading, Google’s Site Reliability Engineering textbook is a great place to start. Note that it’s not just for SREs - everyone involved in the creation and maintenance of production software needs to understand the best practices in order to do their job well.

‍

Author

Generated £1.3M pipeline by focusing on UTM parameters personalisation.

Pedro Costa

Growth experimentation

Generated £1.3M pipeline by focusing on UTM parameters personalisation.

a blurry photo of a red wall with a clock on it

Heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.