Scalability Story

2018 began with a huge performance issue in my application. We observed few performance issues in 2017 but those were one offs. This time it was different, they stayed for longer time and with increased frequency. After some research on past volumes we found that the application started processing 70% more than what it used to process 2 years back. The increase in volume was due to both organic and inorganic growth of data that needs to be processed. At some point the issue escalated to such a level that we started doubting sustainability of the application with existing design. This made me and my team look at all possible avenues to scale.

Such issues take a toll on the development & support teams. It takes 20% to 30% of their time trying to help the application steer away from these issues.

During this time, I came across 2 excellent books:

  • Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems” by Martin Kleppmann
    • This is a long read, and has hidden gems scattered all around. While reading this book I had multiple aha moments, as the design consideration demystifies. I was thrilled to find the author discussing in details the merits of having a mix of unstructured (that has complete view of the information in one place) and structured data (like in relationsal databases) at the same time for the application to use.

Few very handy Oracle feature I learnt during this time was SecureFiles for BLOB & CLOB data storage, use of DBMS_REDEFINITION package and having tablespace with differnt extent size to optimize database performance.

I was exploring both horizontal and vertical scaling for my application; it was quite obvious that I landed up exploring Micro Services and DevOps. What I realized that their core concepts are not new, they were around without being formalized. It just that now there are off-the-shelf software that enable us to create Micro Services architecture with ease. Similarly DevOps is the only logical way to survise in complex environment. In my experience,  teams maintaining complex systems are practicing DevOps style of working without even realizing. Its just that we have a term now for their style of work.

Our need to analyze performance forced us to extend our logging to include additional timing details to identify the slow components. Some may argue why not use a profiler; we tried few profilers on our application servers but, understanding their output became difficult. I guess it needs years of practice to understand and act on the output from these profilers. Finally, we extended our log to write granular performance figures. With generation of additional data, we hit the problem of analyzing the data. This forced us to build tools that analyze the data from the logs. Python, sed & awk came very handy for building such utilities.

We learnt the power of service duplication, especially when CPU & RAM were constant. If the underlying component can withstand more volume and our processing is the only bottleneck then duplicate the service. Like split 1 large feed into multiple small feeds and process them simultaneously, split exporting of data and export them using 2 processes instead of one. Spliting of data to be processed needs some thinking, especially when the sequence in which the data needs to be processed matters.

Luckily we were also in the middle of infrastructure migration project, where we get to redesign the application with much more scalable design. With the migration project the first thing we noticed was our servers written in C++ are performing much better in Linux than on Solaris. However I cannot tell you by how much as there are other things that changed. We moved from regular oracle database to Oracle ExaData. Our 50 percentile mark of transaction preocessing time reduced by 50%.

Before our complete application use to run on a single server; as with the migration we were changing too many things, we tried to stick to similar architecture. This is just because we did not want to impact the project timeline. However, we all realized running everything from a single server will limit scalability. Distributing the application over multiple servers or Virtual Machines (VMs) is the right thing to do. Just to have a hint of future architecture, we built an active/active BCM solution; with only one component running on the passive site and rest running on active site.

Creating the right BCM solution is a challenge, when every component like database, application server and middleware has its own BCM solution (meaning they can toggle between multiple sites). Everthing can fallback separately, this will increase availability. This is perfect as long as every component can really failover, but if they can’t for some reason it can potentially halt the complete application even though we are running the application from multiple sites.

In 1st quarter of 2019 we completed the migration. The new application worked faster than before. But the load increased again. This time the servers were able to keep up with the load. However our export interfaces became the bottle-neck. We observed that the CPU and memory were maxing out. Strangely more than 50% of the available RAM was occupied by OS cache. We had no options but to move few application servers to the second site to balance the load. Luckily, we have already proved that the application can be run simultaniously from both the servers. By July of 2019, we are running actively on both servers.

It would be nice to scale the application horizontally using modulus splits and running small components on VMs. This will increase the need for Queues. The application uses different types of queues, file queue, MQ and DB Queues. File queue can be increased dynamically; but, increasing MQs needs time. This is a limiting our horizontal scaling ability.

Currently exploring Kafka as a way to consolidate the need for different queues into a single type of queue. I really liked the official kafka documentation available on https://kafka.apache.org/. Now, its time for us to do a Proof Of Concept.

Now that we are running the application from multiple sites, we need a better orchestration mechanism. Initially we used information on shared filesystem that tells what can run or what needs to be monitored on different sites. But a more robust system should use that data from DB and keep a backup in local file system in case the DB is unreachable. (Just a thought, why not use zookeeper? Is it really required to be so robust!)

Another learning: Never use a NFS (Network File System) in production. We used it to make data visible from both sites, But they are super slow compared to a local disk or SAN. NFS can considerably impact the processing of file intensive applications (specially the ones writting files).

The story goes like this: One of the fast service where transaction time was around 20 ms to 30 ms receives data from all services processing data. It’s work was simple, take the message from one queue and write to different file queues (it just writes the data into multiple files). In case the queue grows over a certain threshold it applies back pressure and stop all the producer services till it clears the queue. Introduction of NFS slowed it down by 3 to 10 times. It put a cap on the application’s processing capability. All the performance gain went for a toss. We quickly switched to local drive and finally to NFS drives.

Finally with all focus on the application’s performance we made it fast, but we dropped the ball on data publishing part. By making the application fast it started generating more and more data in short time. Now the pubishing part became slow. Backlogs started piling up on disk as it used file queues.

There is no quick solution, we cannot slow down the processing. We added disk to keep the files for later procesing. From past experience we know splitting that data and procesing through multiple services will help. So, we implemented a sharding techniques based on our numeric primary key. We used a modulus operation to split data over multiple queues and process them simultaniously. The cost of this operation is less, and to start with we used modulus 2 (i.e. key%2=0 & key%2=1). So the data is getting divided equally. The beauty of this design is, if needed the modulus can be increased to n and scale horizontally. Finally we managed to cover the backlog.

To be continued…

That is the big story that will continue slowly. The slowness is because we don’t have dedicated team to scale the application. We do them by bringing in small internal changes with every release (like internal improvement project).

While big changes were going on, we were also tuning the application code for performance. Historically, there were crashes. These crashes did not hurt us before as their frequency was less. But with the volume we are processing, their frequency increased. Everytime a service crashs, at least it takes 5-10 minutes to detect and restart the service. So, we invested our time in fixing these crashes. Also we reduced the recovery time to 5 minutes at the most. That increase availability. Increase availability means more processing a small but crutial win.

I coined a phrase, “what is rare is not so rare in high volume environments”. I guess it’s another version of Murphy’s law: “whatever can go wrong, will go wrong.” 

Leave a Reply

Your email address will not be published. Required fields are marked *