mandag den 30. september 2013

The Smallest Distributed System

I attended the "The Smallest Distributed System" talk by Mathias Meyer at this years GOTO Aarhus, and I quite liked it. The speaker got his message through to the audience and he touched upon a lot of the important points to remember when doing a distributed system. The points were nicely exemplified, by presenting his experiences with the Travis-CI system and the problem this system faced growing from 700 builds per day in 2010 up to currently 70000 builds per day. The main lessons that he advocated for were the following:

Visibility


You need to be able to monitor your system and at real-time be able to tell what is broken. When you get this visibility, it will lead to the need of responsibility because you need to react to the problems you see in production. The visibility obtained in the Travis-CI system led to a bunch of restructuring and refactoring of the system. Time-outs against external APIs (e.g. github api) were changed from 10 minutes to 20 seconds, and a retry mechanisme was introduced enabling the system to fail fast and also to ignore when non-important requests failed.

Uncertainty ... Modularity


You need to deal with the uncertainty associated with running a distributed system in production. There will always be something failing or performing badly somewhere in a complex distributed system. To be able to deal with this, the codebase need to be well-structured with simple and small dependencies between the different modules in the system. In the Travis-CI codebase, they had (and probably still have :-) a big ball of mud, where everything heavily depend on the travis-core module. So in short to deal with complexity and uncertainty you need modularity.

Simplicity


One of the problems in the Travis-CI architecture was the logging module and the need for ordering logs to be able to display and persist them in the correct order. The first implementation had to synchronise log events to ensure the ordering. Changing the solution slightly by having each log entry know its order, within the full list of events in the a build job, simplified this radically. Having the log entries know their order, made it possible to scale-out the logging functionality. So simplicity made it easier to scale. Simplicity also made it easier to reason about what for instance went wrong when having a breakdown i production.

Mathias concluded by mentioning that in the Travis-CI system, they still have a problem with scalability because of the relational Postgres database being a bottle-neck.

Q&A


The most interesting part of the session actually was the Q&A part. Here is my take on the questions and answers:

Q) What book can you recommend on this topic?

A) None that I know of. I read papers instead. You can find reading lists on various blog posts. Leslie Lamport (http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html) has a written numerous papers on the topic which can be found on the Microsoft Research web page. Steve Vinoski (the Track Host) recommended the work published by the Distributed Systems Research Group at MIT and also Michael Nygaards Release IT book was mentioned.

Q) Are you using circuit breakers?

A) No but we are considering it and will probably implement it in the future. Matthias explained the concept, you can read about in Michael Nygaards work. In short it is a pattern that when implemented open and closes the connection to a service based upon a heuristic about how many erroneous responses the service has returned.

Q) Have do you share a temporal clock between the modules in the distributed system?

A) We don't. We have isolated the responsibility of incrementing counters for ordering log events within one module.

Q) (my question so the app actually works :-): Wouldn't it be obvious to use a NoSql database [like for instance Riak] (added by trackhost Steve Vinoski) for persisting the log-events?

A) I'm actually quite fond of Postgres, but yes - it could be a good idea to add a NoSQL database like for 
 instance Riak. Because of the isolation of different services in the current architecture it would be quite easy for us to exchange the current database with a NoSQL variant. But introducing a distributed NoSQL database also introduce an extra distributed system and thereby it also extra complexity a sources for failure.

Q) What is the Gatekeeper doing? (see slides for more details on the architecture)

A) The Gatekeeper transforms a commit into an executable build job

Q) How do you collect metrics?

A) We use Librata metrics with a custom collector. We need to improve it. We have assigned a unique UUID to each build job request and thereby making it possible to track its log entries through the system.

Q) Does all of Travis-CI only run on AWS?

A) No it actually also runs on EC2 and Heroku and we also have dedicated hardware for the system. L

Ingen kommentarer: