Devver started with a good idea (web-based tools for Ruby
hackers) and a working prototype, which couldn't scale. We started working on making Devver scalable and decided to go with EC2, from Amazon. Unfortunately we quickly learned that a library, Rinda, which we built our messaging system on, couldn't connect between multiple EC2 instances.
No worries, we thought, Amazon has its own queuing system, which can be used for messaging, SQS. We thought since Devver was on Amazon's EC2, it maked sense to switch to Amazon's Queues. We first refactored our code and isolated our old messaging away from the rest of the system. Then we we switched over to SQS which we noticed was slower. Since, Devver takes developer tools into the cloud, we have to run our tools against various projects to verify correct functionality and performance. Running Devver against on our test and toy projects, SQS was slower, but it was working out OK.
After building in a stats collection system allowing us to start charting our progress against real metrics, we started running larger projects on Devver. We quickly learned that Devver suddenly couldn't handle or deal with any real projects. Using Devver's metrics to guide us, we isolated the problem down to messaging.
We found out that Devver was spending approximately 60 seconds messaging our distributed machines, for every 6 seconds of work they would do. The round trip of messaging through Devver made the entire project useless. Knowing this wasn't a problem in our Rinda based prototype, we decided Devver just needed a better messaging system. Since we had just refactored the code, the messaging system part was very isolated from the rest of the system and it was easy to drop in a new solution.
While Amazon SQS is robust and reliable, fast it is not (.4 seconds put/take a message remotely, .04 within AWS (on EC2)). We did some research and found many different messaging systems for Ruby, and held a Ruby messaging shootout. This post isn't about what system we eventually went with, because I am sure many of the options would have been fine. I mostly want to share a story of using metrics to pinpoint the worst offender, and isolating parts of your application, so that in a single afternoon you can replace a highly critical part of a project. Changing out one component that is heavily relied on , especially if there is a flaw or a bottleneck, can be the difference between success and failure.
A quick example of the performance changes we saw from swapping out our messaging:
Devver processing Mocha with SQS, 63.8-80.8 seconds
Devver processing Mocha with new messaging, 3.9 seconds
That is it for this post, but look forward to some stats and our detailed review of Ruby messaging systems in the future.
