Observability at Scale: How the right stack can help

Observability is a critical component of any successful engineering organization, as it allows you to identify and resolve problems quickly and efficiently, in addition to providing other benefits. However, it can become increasingly challenging as your systems grow in size and complexity.

Observability at scale at Recurly refers to the ability to collect, analyze, monitor, and understand vast amounts of data from distributed and dynamic systems to maintain the platform’s overall health and performance. Understanding and extrapolating from these vast amounts of data is key to Recurly’s ability to scale and support our customers as they grow.

Key components of observability

When discussing observability, there are several key types of data that we need to consider:

Logs: SaaS platforms generate logs that record events, activities, and errors occurring within the system. These logs contain valuable information that helps in understanding how different components of the cloud infrastructure and applications are performing.
Metrics: Modern computer systems often expose various performance metrics, such as CPU utilization, memory consumption, network traffic, request latency, thread saturation, etc. Monitoring these metrics allows teams to detect anomalies, identify performance bottlenecks, and ensure optimal resource utilization.
Traces: Distributed tracing provides a way to follow the path of a request as it flows through various components of a cloud application. It helps identify performance issues, bottlenecks, and dependencies between different services.

Recurly observability at scale

System and application logs are essential for obtaining full context and details about specific events. They can also help identify and troubleshoot problems, track user activity, and comply with regulatory requirements.

On an average day at Recurly, we generate between 18MB/s to 60MB/s of logs in our production environment, which translates to approximately 70TB per month. To gain insight into the health and performance of our platform from log volume at a terabyte scale, Recurly leverages Graylog as well as Google's Cloud Logging and BigQuery, depending on the use case.

At Recurly, we collect and monitor metrics about our platform's resource utilization, response times, error rates, latency, and more–all of which are valuable in helping us evaluate the overall health and performance of the Recurly platform.

Our platform makes heavy use of Kubernetes. As a result, we rely on the Prometheus stack (Prometheus, Thanos, Alertmanager, and Grafana) for most of our metrics needs. Prometheus scrapes and stores in the neighborhood of 4M time series, which translates to about 1.5T of data per month. These metrics are evaluated in real-time, and if certain conditions are met, alerts are fired to notify appropriate parties to ensure that necessary remediation action is taken.

In addition to alerting, we employ auto-scaling functionality, which dynamically provisions capacity at various parts of the platform to prevent potential performance degradation. We also regularly review historical data to identify trends, recurring hotspots, or general areas for improvement.

The final piece of our observability toolkit is our tracing implementation. We rely on a popular 3rd party SaaS solution for our tracing. Our applications are instrumented to send telemetry data about the requests flowing through the system to a central server. From there, we can visualize a request's path through our platform, what applications were involved, and how they performed.

Tracing allows us to get a very deep understanding of a request lifecycle, identifying individual elements that may be underperforming, such as methods, DB queries, cache lookups, or external calls.

Final observations

Observability is a critical component of any modern computer system. Our observability stack provides numerous benefits, including:

Reduced downtime: By providing early warning of potential problems, we can reduce downtime and improve our overall system reliability.
Improved performance: Reviewing historical and real-time data allows us to identify and optimize performance bottlenecks, significantly improving our system's overall performance.
Improved customer experience: By having clear visibility into the platform's health, we’re able to identify and resolve customer issues quickly and efficiently, leading to a better overall customer experience
Increased security: An effective observability solution allows us to identify anomalies and mitigate security threats.

The Recurly platform is built from the ground up on a highly scalable, elastic infrastructure with resources being provisioned and decommissioned all the time. In addition to the sheer size, this state of constant flux presents added observability challenges.

“Iterate everything” is one of Recurly’s core values. With that spirit in mind, we’re always evolving and looking for improvement. This includes exploring new tools to improve our observability posture. The tools we leverage have changed over the years, and we plan to continue adopting and implementing new ones as we continue to help our customers scale.

Want to learn more? Check out Recurly's smart reporting and analytics solutions.