Does using stateful web servers make sense?
I am working on a web application, which historically was built on a PHP/MySQL stack.
One of they key operations of the application had to do some heavy calculations which required iterating over every row of an entire DB table. Needless to say this was a serious bottleneck. So a decision was made to rewrite the whole process in Java.
This gave us two benefits. One was that Java, as a language, was much faster than a PHP process. The second one was that we could maintain the entire data set in the Java application server memory. So now we can do the calculation-heavy operations in memory, and everything happens much faster.
This worked for a while, until we realized we need to scale, so we now need more web servers.
Problem is - by current design, they all must maintain the exact same state. They all query the DB, process the data, and maintain it in memory. But what happens when you need to change this data? How do all the servers maintain consistency?
This architecture seems flawed to me. The performance benefit from holding all the data in memory is obvious, but this seriously hampers scalability.
What are the options from here? Switch to a in-memory, key-value, data store? Should we give up holding state inside the web servers entirely?
now switch to Erlang :-)
yeah, that's a joke; but there's a grain of truth. the issue is: you originally had your state in an external, shared repository: the DB. now you have it (partially) precalculated in an internal non-shared repository: Java RAM objects. The obvious way is to have it still precalculated but in an external shared repository, the faster the better.
One easy answer is memcached.
Another is to build your own 'calc server', which centralizes both the calculation task and the (partial) results. The web frontend processes just access this server. In Erlang it would be the natural way to do it. In other languages, you sill can do it, just more work. Check ZeroMQ for inspiration, even if you don't use it in the end (but it's a damn good implementation).
This may be cliche, but data always expands to fill the space you put it in. Your data might all fit in memory today but I guarantee you it won't at some time in the future. How far away that is is the time-frame you have to figure out a better architecture. The statefulness of your application is just a symptom of this bigger problem.
Does everyone do different calculations on the entire dataset? Is this something you can do in a batch overnight and have folks access during the day? How time-sensitive is it?
I think these are the questions you need to answer becuase at some point you won't be able to buy enough memeory to store the data you need. That might sound silly given where you are now, but you should plan on that being true. Many developers I've talked to don't think about what success looks like and what impact it has on their designs.
I agree with you - this sounds flawed, but I'd need more detail to know for sure.
You mention a large data set and heavy calculations, but you don't talk about how the data is updated, when the calculations are done, whether it's a day's worth of data or the entire data set, etc. It sounds a lot like a batch job that could be done daily off-line.
If that's the case, I'm not sure where the web ties into it. Are your web users just doing custom queries after the crunching is done? Is the data read-only or read-mostly for users? Or are they changing the data continuously on the fly?
I wonder if the persistence technology you've chosen affects things? Perhaps a NoSQL alternative could be better for your problem - like a distributed MongoDB cluster.
This is a data-engine question, I believe, as much as it is a web-server-distribution question. Why can't your (central) database engine do the calculation (quickly enough)?
You could store precalculated values which are flagged as stale when the underlying data are changed, requiring a recalc. There's no getting around the need to recalc when data change. You just need to manage when and how the change occurs as it will affect consumers of the data.