Document and test your API

What do you think about creating API documentation where the user can test the endpoints? That’s what this article is about.

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Identifying the Problems

Agoda’s Price Delivery System serves hotel room prices for our web, mobile and affiliate APIs, handling over a million requests per second in five data centers across the globe.

As traffic from these edge clients grew over time, we noticed increasing response times felt by our customers. We set out to find the cause of this trend, and in the end we were able to improve PDS response by 20%. In this article we’ll show you how.

First, we identified the components that could be improved and categorized them into two main flows:

Teams broke these down to tackle optimizations separately. This article’s focus is the metadata retrieval process.

When retrieving metadata PDS instances used local in-memory caches. To measure the computation and I/O characteristics of this flow, we added timing and cache efficiency metrics.

Flow components and latency of system before improvement

Once we started to collect our measurements in production, we observed higher metadata retrieval times than we expected. The 99th-percentile of total API response time was 460 milliseconds, of which metadata retrieval took 110. The cache hit ratio was only around 50%, so by improving that we knew we could reduce the 99P of retrieval.

Additionally, application cold starts caused cache misses that could result in stampedes on the database, increasing its latency and compounding the problem.

We recognized that we could reduce the 99P of metadata retrieval response by improving cache hit ratio and that is what we wanted to solve by taking following steps.

Further analysis showed that our cache hit ratio was being hurt by frequent cache evictions. Traffic from affiliates, which includes broad availability and price searches, contributed to high variation in requests to the PDS. We were using a Least Recently Used (LRU) strategy to evict data when the caches filled, and the variation caused evictions to occur at a high rate.

To reduce evictions, it was evident the system required more cache capacity. According to calculations for the complete set of hotels, metadata size was about 5 GB, and we anticipated it would increase with new business features related to pricing.

In summary, our 99P response time suffered from a low cache hit ratio, due to a high eviction rate, caused by request variability and insufficient cache capacity.

This problem can be reduced by picking up a suitable cache topology for our scenario.

There are different types of caching topologies to solve these problems. Let’s go through different kinds of caching we already know before we jump into the decision.

This is the simplest kind of cache topology. This is the one we have in our existing solution as described in the problem section. Each API instance holds its own cache in-memory on its host server.

In this topology, the cache is external and the service uses a client library to access data from the centralized cache over a network protocol. This type of topology provides data consistency, but it adds latency since data needs to be fetched from a remote host.

Distributed client-server cache

Cache is in-memory however cache data in different instances of the API are synchronized through a client library behind the scenes. This means the change in one cache is propagated to the other instances. This will certainly have good responsiveness, better fault tolerance than centralized but introduces data inconsistency between instances assuming replication is asynchronous.

Replicated cache

In this type, two cache layers are maintained, local and remote. The local cache will be of small size and whenever data doesn’t exist in the local cache, it will fall back to the remote cache.

Near cache hybrid

Different caching topologies have their own strengths and drawbacks. After diagnosing the issue of PDS, we require a fast and high-capacity cache. The other requirements for the system are data consistency and sustaining a high update rate. Considering these constraints, we choose the distributed client-server cache architecture because it fits our purpose. However, this topology has low fault tolerance, so we definitely need a cache technology that minimizes the latency, optimizes the performance and improves the fault tolerance.

Couchbase fit our purpose. Couchbase is an in-memory fast cache that supports key-value storage and easily supports a high volume of concurrent requests. Couchbase provides fault tolerance by replicating data into multiple clusters. Besides, masterless architecture provides easy scaling. Another reason we chose Couchbase is that we already had the support inside Agoda, which means the adoption and learning curve is shorter.

The existing local cache architecture used a write-behind strategy to fill the cold cache. Application cold starts hurt cache hit rate, and put pressure on source DB that increases latency. With this strategy, the API starts with empty cache, and on cache miss the requests fall back to the database and the metadata is written into cache asynchronously.

Even if we could continue to fit the data set in memory, we also needed to assess whether there was a more optimal strategy than write-behind for the hit ratio in our scenario.

We needed to figure out a caching strategy, whether the existing write-behind with cold cache was efficient for this scenario. There are different caching strategies to populate cache stores in any topology that we might choose for.

In this type of caching strategy, the request will be tried to be fulfilled from cache entry. If the data is not found in cache, data will be loaded from the underlying data source.

This strategy is used while updating data. The change is first updated in the cache and then pushed down to the underlying database.

As described for our initial architecture, write-behind caching is when data in cache is written asynchronously.

In this scenario data is loaded into cache before the application is ready to serve any requests. The cache can be partially filled, called a warm cache, or if completely filled then it is called a hot cache.

This existing write-behind technique is efficient when requests are consistent and the subsequent requests are quite similar so that new requests have a high chance of getting data from cache. As one of the end-clients are our affiliates, their search requests introduce huge variation. Not only that, cold start causes huge load on the database. We should be maintaining a hot cache to avoid the fallback calls to the database which increase latency.

Let’s get into detail of how we maintain a total cache of the Price Delivery data centralized in Couchbase. We allocate Couchbase clusters on each datacenter since cross-datacenter queries would have high latency. Also, Couchbase needs to be hot, so we implemented a sync system to push data from the database into these caches.

Precaching
Pre-caching system

The data we populate into cache, called metadata, needs to be updated frequently to capture the changes occurring in primary database records, thus keeping the cache fresh. So the sync system is scheduled to run every twenty minutes, and it fully syncs the source data every time. This may sound inefficient but the cost to determine differential updates correctly was much higher than refreshing the whole set of data. This was mainly because of the complexity of how we calculate metadata: to implement change tracking across all the related tables would be a substantial project.

This sync system that fills up the Couchbase cache has different components, which we call a Scheduled Producer, Pre-cache Builders (consumers), and a Shared Queue.

The producer is a simple component that produces property IDs that need to be synced in a certain predefined period of time. With the Scheduled Producer, the property IDs are pushed into a shared queue. Since Agoda has millions of hotels to sync, a single consumer would be a bottleneck. Multiple instances of the Precache Builder consumer are spawned and share work on the queue of property IDs, preparing metadata needed for the pricing system and writing it into the cache.

This queue also controls the rate of DB requests from this application, no cache stampede effect on cold application starts.

For the queue, we chose Apache Kafka for its high throughput and fault tolerance. It is a distributed streaming platform and supports publishing and subscribing to a stream of records.

This pre-caching system improved the total response time by approximately 20% overall, as expected. This was possible only with the fast total cache. This means no fallback to database and latency for metadata delivery from cache was reduced to approximately 15ms in 99th percentile.

We identified the cache hit ratio problem caused by limited cache size. The limited cache size made 50% of requests to fetch data from the database which was the main problem. This problem was fixed by the changing local cache to distributed cache topology and pre-caching strategy. The distributed cache topology gave us bigger cache size and pre-caching reduced the cache miss to zero thus giving better latency on metadata retrieval.

This is the story of how we improved the system performance by adding a cache layer between persistent datastore and service. But we still have challenges to be solved. The challenges are related to scalability, and maintaining up-to-date data in cache.

The size of metadata is increasing as new business features are added. With the increase in size, the time will also increase to fetch and write data into cache. This will definitely reduce the performance of the Pre-cache Builders.

Cache freshness is the main requirement of the pricing system. In the course of time, the number of hotels will increase so the time for a full sync will begin to run over the freshness requirement. This means hotel metadata will be outdated already before the Pre-cache Builder updates the information for the hotel. Again adding consumers to the pool seems to be a good idea but the database is a bottleneck and it cannot be scaled easily.

References

Add a comment

Related posts:

How Not to Live as Digital Commodities

The 2010s was a decade marked by a fundamental shift in our relationship with technology. We used to envision it as a tool of emancipation: from the Arab Spring, Gezi Park, to the “Umbrella…

Qualities of an Extraordinary Leader

With every aspect of our businesses constantly changing, we require leaders that are able to adapt quickly and be innovative in navigating through new challenges. To be extraordinary, you might say…