December 1, 2012

Second Data Center and Disaster Recovery

Dan Burkhart

Member of the Recurly board

TABLE OF CONTENTS

TOPICS

Product updates & news

We’ve made a big investment in the future of Recurly.

Recurly is closing out the year having made a tremendous amount of progress. Not all progress has been evident externally, as our entire Engineering organization has been focused on making critical changes to our application architecture and preparing for our service to run across multiple data centers. These changes provide our customers with the following enhancements to our service architecture:

◦ Greatly improved fault isolation

◦ Data Backup and Data Resiliency

◦ Enterprise-class Disaster Recovery capabilities

Recurly’s architecture is now set up to scale horizontally and to distribute data across multiple data centers located in multiple geographies. These improvements provide the kind of risk mitigation that our customers expect from us.

We’re very proud of this milestone achievement and know that these ‘invisible’ capabilities provide the most critically important ‘feature’ of our entire service. These capabilities provide the foundation for trust and confidence that our customers place in our service.

Fault Isolation:

We’ve worked with many experts during this buildout, and one of the core concepts we’ve embraced in our design is an ability to ‘plan for failures’ and to fail elegantly when they do happen. In the event of a hardware failure it is our responsibility is to ensure that we have mitigated the risk and isolated the resulting impact through a combination of hardware and software design principles. (For example, if an encryption failure occurs in any one of our instances, only a percentage of transactions would be impacted, and the card data is also replicated in other locations so the time to recover from this situation is greatly reduced.)

Data Backup and Data Resiliency:

Critical data in Recurly is either Site data (account, subscription, contact data), or Card data. Each data type is sharded across multiple environments to accomplish fault isolation (described above). In addition, automatic backups are scheduled daily so that both Site data and Card data are backed up and validated every day. When credit card data is submitted to Recurly, it is replicated in machines across two geographic locations.

Data resiliency has been achieved by automating not only the backup processes, but also the validation checks across multiple systems. Resiliency is achieved by ensuring that data is backed up properly, and ensuring that data is replicated across multiple regions. These remote backups ensure a ‘backstop’ against catastrophic data loss and are designed to take place on a daily basis. In the event that a catastrophic failure has occurred, this approach will minimize the time required to recover from a data incident.

Disaster Recovery:

We have also planned to ensure that a complete disaster would not put our customers’ businesses at risk. This has been accomplished by the addition of a second data center in New York City, along with modifying our application to be available for immediate deployment in Amazon Web Services (AWS) facilities. For example, if a massive earthquake were to occur in California, we would restore services to our impacted merchants by failing over to remote AWS locations. These failover environments access account and card data which has been backed up across multiple machines in our second data center across country.

Policy Enforcement:

In addition to setting up a second data center, we have also implemented significant changes to ensure that our new architecture is providing the additional benefits it was designed to provide our customers. Data Backups and Resiliency only provide value if they are guaranteed to be working properly. Recurly has nightly backups in place, with automated verifications providing alerts if backups have not completed correctly. In addition, we run manual verifications of backup systems and isolation functionality at least once per month. This means that failovers are now routinely tested and confirmed to ensure that the system continues to function properly, as designed.

These policies were exercised in our recent maintenance window (11/29/12), which validated failovers from power outages, firewall device failures, and encryption appliance failures in our Production environment.

We are pleased to be able to speak to these new capabilities. We are very proud to be able to deliver our service via a set of architecture that now supports truly enterprise-class fault isolation, redundant failovers, data protections and disaster recovery capabilities.

Changes to our service architecture will be occurring in stages. The initial stage is to back up all data types cross country. This initial step has been completed. We are now creating multiple isolated instances entire application architecture in our West Coast data center. This is currently underway and should be completed in December 2012. Our next step is to create emergency service architectures which will be made available using Amazon Web Services. The third step is to have multiple instances of our service application architecture available in our East Coast data center. All three of these initiatives are currently underway and are anticipated to be fully complete by January 2012.

We’re proud to be executing against these initiatives, which will directly benefit the resiliency of our service. Many thanks to our customers who have placed their trust in us to execute on these important capabilties.

- The Recurly Team

TOPICS

Product updates & news