Recurly’s customers rely on our search functionality to find accounts, transactions, and other information so they can solve their customers' problems. By March of 2014, however, Recurly’s search functionality had become an example of tech debt: it worked well enough to ignore, until it didn’t.
We’d been planning on rebuilding the search infrastructure for six months at that point, but other projects (sales tax, VAT, etc.) kept taking priority. As our old search server started getting overwhelmed, we were letting customers down. Searches started to take up to 60 seconds, and some then failed.
As you can see by the graph of support tickets, our customers let us know how painful this could be. At this point, we made fixing search our #1 priority.
The search capability customers were depending on was powered by a two-year-old version of Elasticsearch on a single server that was quickly running out of capacity. Elasticsearch is an application used to analyze structured and unstructured data and deliver “actionable insights in real time”.
The new search system we’d started working on would replace the single server, and the old version of Elasticsearch, with a cluster of three servers running the current version of Elasticsearch. By making the move to a cluster, we could seamlessly scale out in the future just by adding additional servers.
As part of the changes we wanted to use Elasticsearch to power all the main listing pages. This would allow some obvious enhancements, such as searching and sorting by more fields. It also allowed us to update the filter counts in real time to reflect the search query and filters that had been applied. Now you can look at the subscriptions listing page, search for a plan name, and see breakdowns of renewing vs canceled vs expired or paying vs trial.
Since search was so critical to the way most merchants use Recurly, we decided that instead of the hard cutover we’d initially planned, we’d take a more gradual approach. As we started making changes to each page we’d create a new URL, e.g. /accounts_v2, where people could try out the new search functionality. Once we were sure that the new results were as good or better than the original, we’d move the old page to /accounts_v1 and promote /accounts_v2 to /accounts. Then we’d repeat the process on the next page.
This strategy allowed us to offer both the old search and the new search to customers at the same time; to do comparison testing between the new search and the old search on production data; and to keep using V2 to beta test small changes even after the production code had absorbed the original set of upgrades which we had intended to make.
As we started rolling the new search out to merchants we kept hearing that, while the new search was much faster, queries that worked against the old system no longer worked. We found that our customers were searching for things we never expected. We created automated test cases with the queries and the expected results to ensure that we were able to match the old behavior and keep it working as we made other improvements.
All of this made for a much smoother release for the new search capability on the other listing pages. Customers were happy, support tickets dropped sharply, and we could move on to other challenges.
We took a few lessons away from the experience:
Stay ahead of tech debt: The old server should have been moved to a cluster and updated continuously, rather than allowing it to get stuck on old, hard-to-upgrade version.
Avoid big cutovers to new functionality: By having a period of time where the old and new code live side-by-side, we’re able to identify and fix issues with a minimum of disruption for our customers.
Demo changes to a limited audience: Putting new capabilities behind a feature flag lets us show them to a subset of users and see how they take advantage of new capabilities. We can then use the new information to build test cases around real data.
We continue to use these lessons as we engineer new improvements to Recurly’s software.
Please offer your thoughts in the Comments area below. If you’re interested, we’ll have a future blog post dealing with the more technical details of Elasticsearch.
Do you have your own experiences with Elasticsearch, or making search work better? Questions? Concerns? Share them in the Comments below - or send directly to the author at firstname.lastname@example.org.