circle
star
min reading

Build Your First Observability Stack Easily

Build Your First Observability Stack Easily
Table of Contents
Newsletter

A complete observability stack is essential when supporting code in production.

This guide introduces some basic observability concepts and demonstrates how to get some essential tooling up and running. We’ll be setting up structured logging and prometheus metrics, exporting both and visualising them via Grafana, as well as alerting on frontend errors with Sentry.

Tools like these are your eyes and ears. Here’s how to set them up.

Our stack

How are we building userled?

The backend we’ll be monitoring is written in Golang and hosted in AWS’s ElasticBeanstalk (EB). This is their managed version of Elastic Container Service (ECS), allowing one versioned “Application” to be deployed to and scaled horizontally on many environments.

An EB Application can run a pure binary or a collection of docker images, orchestrated via Docker Compose. By running many images in a Docker container we achieve flexibility similar to that of a Kubernetes Pod, easily adding sidecars as we please.

On the frontend we have a NextJS app deployed via Vercel.

For observability tooling, we’ll need a free Grafana Cloud account. This lets us store logs in Loki, store metrics in Prometheus, and query both via Grafana.

We’ll also be signing up for Sentry on the free “Developer” plan.

Higher paid tiers are required for advanced features such as collaboration and extended storage retention.

Logging

Used correctly, logs are an effective tool for detecting problems in your application.

Signals and Noise

The frequency of logs and their specified level determines the signal to noise ratio. Noisy logs will result in fatigue, and potentially serious issues may accidentally go unnoticed.

We enforce the following rules to maintain a high signal to noise ratio:

  • Error: Something is unexpectedly and seriously wrong. It needs attention.
  • Warn: Something is off. It deserves attention.
  • Info: Announce something without alerting anyone. Useful information.
  • Typically should not scale with the number of requests. Use metrics instead.
  • Debug: For paths which may need regular local testing.
  • Typically these are only enabled in testing environments, meaning use-cases are few.

It’s worth auditing the quality of your logs before going into production to ensure you aren’t swamped in noise at the first sight of traffic.

Structured logs

These enable log messages to be accompanied by key-value pairs, or labels. For example:

Labels allows logs to be easily and efficiently indexed and filtered.

This is particularly essential if using Grafana Loki, which only indexes the labels, rather than indexing the whole message. Alternative solutions such as ElasticSearch (used with Kibana) allow complex searches on the entire message, requiring more complex and expensive indexes.

We use zap to produce structured logs.

Scraping logs

Here’s what happens:

  • The Golang binary writes logs by default to stdout
  • A Promtail container periodically reads this log output from disk and pushes it to Grafana Loki
  • The Grafana UI queries Loki, allowing for filtering, searching, and dashboard building.

Alongside our app, we attach a Promtail container …

… and configure it in a nested directory. See the Promtail docs for more details.

Here we

  1. Serve requests on random ports. (There’s also a disable flag)
  2. Define a local positions file to save scraping progress
  3. Define the connection information for our Loki instance in Grafana Cloud
  4. Define where to read logs from, and which logs to read. We find our logs in /var/lib/docker/containers and only scrape logs from containers with label logging: promtail.

Observe!

Export the necessary environment variables and test it out with docker compose up. Then trigger some logs and check out Grafana.

Your logs should be visible in the “Explore” section, with your Loki instance selected as the data source.

Modify and try the following queries:

Alerting

Next, it would be wise to make some serious noise if any Error logs are produced.

In Grafana, under the “Alerts” section, we can create an alert rule on the presence of error level logs.

Screenshot 2023-02-08 at 16.20.21.png
Screenshot 2023-02-08 at 16.21.09.png

Note that this alert policy will have available label env.

We then need to configure somewhere to send the alert. This is found under “Alerting” > “Contact Points”. Add a simple email contact point to begin.

Each triggered alert rule is routed to a Contact Point using “Notification policies”. These policies branch on the labels available on the alert rule, for example env.

We can use Notification Policies to route all alerts with env=prod to a production support mailing list, and all alerts with env=staging to a staging list.

Screenshot 2023-02-08 at 16.34.44.png

Metrics

Adding prometheus metrics to our application unlocks a huge amount of potential. We can track request latencies, observe throughput and queue backlogs, or simply see how much traffic our app is handling.

Metrics can be accumulated in a range of data structures, for example

  • Counters, for continually increasing metrics
  • Gauges, for increasing and decreasing metrics
  • Histograms, for bucketed times

Each data type also has a “vector” equivalent, allowing it to be segmented via labels.

<aside>⚠️ Label keys and values should be chosen from a constant and limited set, in order to reduce strain on indexes in prometheus. I.e. don’t create a new label for every request.

</aside>

Recording metrics

Our app needs to serve a /metrics endpoint.

package prometheusimport ( "context" "net/http" "github.com/prometheus/client_golang/prometheus/promhttp" "userled.io/common/log")func ListenAndServe(ctx context.Context) { log.Info(ctx, "Serving prometheus metrics at /metrics") http.Handle("/metrics", promhttp.Handler()) if err := http.ListenAndServe(":2112", nil); err != nil { log.Error(ctx, "error serving metrics", "error", err) }}

Imagine a generic processor which consumes events from a queue. We can maintain a counter of how many events this processor has seen.

package processorimport ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto")var eventsCounter = promauto.NewCounterVec( prometheus.CounterOpts{ Namespace: "userled", Name:      "events_total", Help:      "The total number of events received by X processor", }, []string{"type", "org_id"},)func recordEventMetrics(eventType, orgId string) { eventsCounter.With(prometheus.Labels{ "type":   eventType, "org_id": orgId, }).Inc()}

Then we simply call recordEventMetrics each time we see an event.

The produced metric can be viewed in full, or filtered on by its labels. It can be seen as an incrementing counter over time, or it can be judged by its rate of increase.

Add counters, gauges and histograms to your hearts desire. But get hooked up to Grafana first …

Scraping metrics

Screenshot 2023-02-08 at 14.52.40.png

Does this look familiar?

Let’s add a prometheus container to our app.

# ./docker-compose.yamlversion: "3.8"services:  backend:    ... ports:       - 2112:2112 # ensure you expose your metrics port    depends_on:      - promtail - prometheus  promtail: ...  prometheus: image: prom/prometheus:v2.41.0    ports:      - 9000:9090    volumes:      - ./prometheus:/etc/prometheus    command: |      --config.file=/etc/prometheus/prometheus.yaml# ./prometheus/prometheus.yamlglobal:  scrape_interval: 15s  scrape_timeout: 10s  external_labels:    env: <ENV_NAME>remote_write:  - url: <https://prometheus-your-url.grafana.net/api/prom/push>    basic_auth:      username: <PROMETHEUS_USER>      password: <PROMETHEUS_PASSWORD>scrape_configs:  - job_name: scrape    static_configs:      - targets:          - prometheus:9090          - backend:2112

Here we

  • Scrape metrics every 15s, with a timeout of 10s
  • Inject the environment name and auth credentials using sed , as prom does not support environment variables 🙃. This is handled by our CircleCI config
  • Define the remote prometheus store to push metrics to
  • Define two containers to scrape: prometheus and backend

Profit 📈

Beautify in Grafana. It’s that simple

Screenshot 2023-02-08 at 15.04.31.png

Frontend Alerting

Let’s not forget about our frontend - it’s the face of our app! If this errors, we need to know.

Ideally we could push key frontend logs to Loki and visualise them through Grafana, keeping all observability tools in one place, however this dream did not play well with our deployment setup.

We are fortunately huge Sentry fans. Sentry rounds up logs from client browsers, edge services as well as your frontend deployment, and send alerts to Slack when any errors are logged.

I won’t attempt to improve their installation process because it’s already handled flawlessly by them. The NextJS integration even involves an installation wizard ✨.

Closing Words

This post only scratches the surface of observability - we completely omit tracing!

Let us know if you’d like to see a part 2 on this.

For further reading, Google’s Site Reliability Engineering textbook is a great place to start. Note that it’s not just for SREs - everyone involved in the creation and maintenance of production software needs to understand the best practices in order to do their job well.

Author

Related Articles

No items found.

Crafting Effective Contextual Personalization

Crafting Effective Contextual Personalization
Yann
Yann
Cofounder & CEO

Learn how to create personalized experiences for your audience with effective contextual personalization strategies.

Learn More
a black and white picture of a blue rectangle

Meet Our New Joiner: Federico from Italy

Meet Our New Joiner: Federico from Italy

As a startup enthusiast, I have started seeing the PLG acronym pop up everywhere, and for good reasons. By enabling the product to take the center stage in the sales process, PLG companies can scale much more efficiently and effectively.

Learn More
a black and white picture of a blue rectangle

Personalize Your Enterprise Website

Personalize Your Enterprise Website
Yann
Yann
Cofounder & CEO

Discover how to tailor your enterprise website to drive success and engage your target audience effectively.

Learn More
a black and white picture of a blue rectangle

Fuel Your Pipeline With Personalized Touchpoints at Every Step

Book a Demo
an orange toy airplane flying through the air
an orange and pink painting with a white background