Saturday 7 April 2018

Scale Is A Big Word - Akka For The Rescue

I think scale in recent times just becomes a buzz word people can say when things just not working, yet in my opinion this word have much bigger meaning. Scale can be used when the software is just not scalable, for example if 1,000 users access a website, and the website becomes unusable, the website is not scalable. And yet, people use it for a lot of other stuff. Lets take for example the Javascript language. In my experience Javascript is not scalable in the terms of developer numbers.  Javascript is usually working good when a single developer is working on each component, and no integration is needed. This is happening usually because of the non-strict behaviour of the language. Those kind of behaviour caused language like Typescript to market them self as "Javascript that scales"



Now we are seeing that the scale word can interpreted differently than what people would assume. This goes further, some says for example, that a system that can not be debugged easily is not scalable and so on.

The brings us to this post topic, how to write system that scale in 2018?
Until my current company I did not really handle much scale. Usually the code I wrote was serving quite a few users, was small and easy to maintain, and usually small server can handle the load. In my new company we started to face some scale issues, and it made me investigate how the world is handling those problems we faced.

We human, like to think in a certain way. This way caused us to write monolith applications. This was not an issue until recently, as demand for a single application was not that high, and usually expanding on the "X" axis was quite easy. One of the sections that required us, humans, to step our game was web development, as demand increases for high traffic web applications. Web applications might still considered monolith, as each web request is independent from each other, and usually share some kind of database to share information. This postpone the need to write non-monolith applications because we can still get along with the requirements of the market. So what is so bad with monolith applications? well... CPUs, servers, and datacenters are not monolith. Few years back, Moore's law took a hit, and the single thread performance basically stayed the same in new processors. In order to compensate on that, processors companies started to make more cores in a single processor, make it much more power efficient, and lower the cost of computers and hardware in general. This meant that now, if I write a simple monolith application, my performance is limited to the thread performance, while I can't enjoy the cheap cost of running 50 cheap server few bucks a month, and usually will have to buy huge server and hope they will do the work.
Also, some applications are not web based, and can't benefit from solutions like Play framework and other server-less solutions like Zppa. At this point I want to express my respect to the web world - in this days a single developer can easily deploy his website, and it will support millions and more of users without the developer having to deal with any of the DevOps problems he had in the past.

But what about non-web applications? Or even non-standard web application? When the separation between requests in not obvious?
There are a lot of tools for the developers to try to mitigate that, those tools usually still endorse the monolith way. The tools consists of threads\locks which causes a lot of problems of its own. Threads are expensive performance wise, and locks are prone to errors like not putting a lock in the right place, or causing a dead lock. Also, debugging multi-threaded concurrent application is a nightmare. What is the flow and am I debugging the flow that I am interested in or something else? analysing logs suffer from the same issues. Also, what happens when I run out of core on the computer? How can I know which part of the application to move? How the different threads will communicate? My point is, comparing this approach to Zppa is making me think we did not made any progress at all for every thing that is not non standard web application and we are still years back.

I started reading about other success stories which includes large scale deployment like Whatsapp erlang and Netflix download service which lead me to wide range of new concepts which were probably important for me to understand as those companies created really good products in the end of the day.

Lets start:

1) Actor Model - this model was invented few decades back and really was ahead of its times. The idea that each mutable state of the system should encapsulate by a message queue. This is called an actor. The idea is that actor process each message at the time. When you come to think, this is exactly like a lock - the lock already contains queue inside it in order to keep track of which requests come after the lock is being free. This forces the developer to think in terms of states instead of locks. Also actors can create new actors, and supervise them. Some of the implementations of the actor model are Erlang, and Akka, which we talk about them further in the post.

2) Event Sourcing -  The idea here that everything is an event, which converted into a command, which then can alter the state of the actor. The events are a fact, which makes them immutable. Events can help us better understand what happens in the system. Also, can cause actors in case of crash or other problems, to restore their states to the last working one by replaying all the events until the crashing event, making a much more easy to debug and make the system much more crash tolerant (just let it crash..).

3) CQRS - This separates the reads from the writes. This way we can create mirrors of the data we need to query, creating many advantages like not slowing down the write side, reducing merges and more.


I focused on Akka, because back then I think its the most mature and easy to use from various frameworks out-there. Akka is actor model implementation. It consists of Akka Actor, Akka Persistence, Akka HTTP and more.

Akka actor contains a mailbox, which is a message queue. It also exposes the APIs to create new actors, send messages and all the other stuff that actors can do. In order to handle concurrency Akka creates a thread pool, in size of the CPU core count the machine have, and then the dispatcher check all the actors mailboxes, when it sees that a actor mailbox have a message, it gives it cpu time. This means that only actors that need handling messages get CPU, and most importantly, an execution context. Its important because execution context also costs, less than a thread, but it costs. Lets say we are trying to handle 2,000,000 outgoing web requests. In regular programming we would "await" for the request hence taking execution slot. In Akka, when we wait for a message to arrive, the dispatcher does not "care" about us, and we are not taking any execution context.
Akka persistence is a API that let the actor persist the commands it receives and possibly save snapshots of the state - which is implementation of event sourcing. The database behind the persistency is generic. It can be Cassandra, levelDB, Mongo and more. The developer does not care about it as its transparent.

Now lets talk about a simple use case for Akka - simple ticketing system. Lets say each ticket is an actor, which its state is the ticket description, title, comments and so on. Using Akka persistence, I can tell the actor something like "If you did not receive any message within 10 minutes just kill yourself". In Akka persistence, killing actors is fine, because I can always recover them. In this case, I keep in memory only the most used tickets for best response time. This is called passivation.  I can without any further work audit every change to the actor, as everything is just an event which can be viewed in the events journal. Now thats just a simple example. Lets say that we have to load some extra data for the ticket. We can create actors that bring those, and those actors can run concurrently and now we can fan out to calculate the things we want. And as said before, because the way actor works, no locks! because every actor encapsulates very small state, which only the actor can change, there is no need for locks.

With this knowledge, I took one of the heavy algorithms we had, converted it to Akka actors, and tested it in real scenario. For starter, the performance of Akka in something out of this world. The ability to handle millions of states, while everything is persisted, and without that many extra work for infrastructure side. Everything can be distributed almost out of the box. Amazing! 
Because everything is queue based, throttling the load on every component was quite easy. For example, if I saw that one of my "sinks" is doing too much work, I can tell the queue to back-off little, save the data in its buffer and continue slowly without impacting the rest of the system.

Now thats was not enough for me. I wanted to test how far can I go, as my reak scenario was pretty small.  I made a simulation of all the inputs and outputs that I had, and wanted to reach the limit to also test the distributed features. Right form the start I hit some big memory issues. Its was just too many states that were trying to occupy the memory. Akka suggested some sort of back-pressure  with:

1) Akka streams - based on Akka actors, Akka streams is based on "Source" "Flow" and "Sink". Source in the input stream - can be a file buffer for example. Flow is what we do for each chunk in the steam. For example flow can split the stream into multiple stream and then reduce them while doing every thing concurrently. Sink is where the processed data reaches. The sink can for example save the output into a file. Akka streams by its core supports backpressure, so if the stream is already handling data, the source can move to the next element. The problem with Akka streams that its not fully support distribution. Also, akka streams does not support persistency at all.

2) Work pulling pattern - In this pattern the developer needs to implements a back-pressure mechanism, which sends a message to the source actor, which tells it when to stop. The problem I had with this approach is that is a pain in the ass for the developers. I dont want the developers messing with infra pattern during the development, they should develop logic, not infra.

3) Custom mailbox - With this, I can override the default mailbox, and tell it what is "Work" message, and what is "Work Done" message. Then, the mailbox can keep track of active "work" and decide that up to a certain threshold, tell the dispatcher "I dont have any messages waiting" and then, the actor would not dispatch any more new work until things settle down. In this approach, I can decide how many states I save in memory for each type of actor. For example, if a actor encapsulates a boolean state, I dont mind save millions of it, but if an actor saves a user, I don't want to keep many in memory.

Custom mailbox solved my memory problems all together. With Akka I could maintain a system that handle millions of IOs in short timeframe, everything is persisted, with very small amount of CPU and memory, while being easily expand to more nodes as necessary - truly scalable!

In the future I want to implement something like re-ordering the actros based on mailbox sender statistics. That means that I can have my system spread across many regions of the globe, and Akka will try to cluster the relevant actors for each site automatically, and will perfectly balance the load across the globe considering latency cost ("Metric").

And yet there much more to learn. Debugability for example using event sourcing is not easy. Default Akka uses Java serialisation, which not really readable. I still reading about kyro-akka which might be much better serialisation engine, which can help me better understand what happens in the system and keep track the timeline. 

Also, monitoring. Datdog looks amazing. Akka itself have a feature to save statistics of every actors, so custom monitoring tools can be done also.

To summarize. Akka requires different way of thinking. Some say that if you could avoid using Akka, don't use it. But if you must get the performance and scalability, Akka can deliver, and can in some cases can reduce redundant infra code writing, as it implements RPC\Message Handling\Pub-Sub and more.

No comments:

Post a Comment