Once one of our application suddenly became slow. It was a simple application responsible for reading data files generated by another application, transforming them, and putting the transformed message on a message queue. The producer application was writing data into a new file after every 100 messages. A slow consumer created a huge backlog. Downstream applications are not getting data in timely manner. Also, number of unprocessed files were just increasing, we were heading towards our disk space getting full. It was like a ticking time bomb. A nightmare!!
Increasing the number of consumer and the number of queues from which to consume was the only option. It is important to maintain the order in each queue. The data must be consumed in the same order in which they are produced. The data can contain insert, modify or delete operations; so, their sequence cannot be change. It is easy to create another consumer, but we cannot just partition the data and give to the consumers. We scripted some logic purely using shell scripts, sed and awk to partition the data using modulus split and feed each partition to different consumer. That gave us some breathing space. Luck was on our side, we had numeric primary keys. We just used an operation like this, whichSplitFile = KEY % 2? FileA : FileB
. As our data is quite random, may be following normal distribution the splits were almost equal. By changing the denominator one can split the data into more partitions if needed.
The team working to find the reason for this slowness was nowhere close. Can’t blame them; this was a hard problem. We suspected something changed in the data which is causing the slowness. It was not slow in dev environment but slow in test and prod where we feed the same input data. As we were not able to improve the performance a small number of us started investigating how we can leverage horizontal scaling.
We built on top of our modulus split idea. So, we made the producer write the data in different folders based on a modulus operations on primary key. As an example, below are some primary keys, we do a modular division 3 and get back results like 0, 1 & 2
PK | Operation | Result | ------+-----------+--------+ 65646 | 65646 % 3 | 0 | 76481 | 76481 % 3 | 2 | 65646 | 67539 % 3 | 0 | 34695 | 34695 % 3 | 1 | ------+-----------+--------+
Based on the result we choose to write the data in folder_0 if the result of modular division is 0, folder_1 if the result is 1 and folder_2 if the result is 2. Also, data with same key goes to same folder all the time. As long as the transactions are random, the distribution of data in these 3 folders will be roughly uniform.
Now instead of a single consumer you can run three consumers. We kept the denominator configurable through a configuration file. That way we were able to increase or decrease parallelism. This helped us scale the application and gave us time to investigate the real issue. Once we fixed the slowness issue, we went back to single consumer configuration. But now the system can be scaled horizontally using modulus split, any time we need.
Fun fact: every time I think of this approach, I picture it as a machine gun hitting different targets with every bullet getting fired continuously.
Later, I found this idea in a book called, Scalability Rules: 50 Principles for Scaling Web Sites. This book is like a thriller and is full of wonderful insights. Many years of learning packed into a small book. Also it showed that we were not the only one to use modulus split for scaling; it’s pretty much a standard technique.