You need to be able to monitor your system and at real-time be able to tell what is broken. When you get this visibility, it will lead to the need of responsibility because you need to react to the problems you see in production. The visibility obtained in the Travis-CI system led to a bunch of restructuring and refactoring of the system. Time-outs against external APIs (e.g. github api) were changed from 10 minutes to 20 seconds, and a retry mechanisme was introduced enabling the system to fail fast and also to ignore when non-important requests failed.
Uncertainty ... Modularity
You need to deal with the uncertainty associated with running a distributed system in production. There will always be something failing or performing badly somewhere in a complex distributed system. To be able to deal with this, the codebase need to be well-structured with simple and small dependencies between the different modules in the system. In the Travis-CI codebase, they had (and probably still have :-) a big ball of mud, where everything heavily depend on the travis-core module. So in short to deal with complexity and uncertainty you need modularity.
One of the problems in the Travis-CI architecture was the logging module and the need for ordering logs to be able to display and persist them in the correct order. The first implementation had to synchronise log events to ensure the ordering. Changing the solution slightly by having each log entry know its order, within the full list of events in the a build job, simplified this radically. Having the log entries know their order, made it possible to scale-out the logging functionality. So simplicity made it easier to scale. Simplicity also made it easier to reason about what for instance went wrong when having a breakdown i production.
Mathias concluded by mentioning that in the Travis-CI system, they still have a problem with scalability because of the relational Postgres database being a bottle-neck.
The most interesting part of the session actually was the Q&A part. Here is my take on the questions and answers:
Q) What book can you recommend on this topic?
A) None that I know of. I read papers instead. You can find reading lists on various blog posts. Leslie Lamport (http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html) has a written numerous papers on the topic which can be found on the Microsoft Research web page. Steve Vinoski (the Track Host) recommended the work published by the Distributed Systems Research Group at MIT and also Michael Nygaards Release IT book was mentioned.
Q) Are you using circuit breakers?
A) No but we are considering it and will probably implement it in the future. Matthias explained the concept, you can read about in Michael Nygaards work. In short it is a pattern that when implemented open and closes the connection to a service based upon a heuristic about how many erroneous responses the service has returned.
Q) Have do you share a temporal clock between the modules in the distributed system?
A) We don't. We have isolated the responsibility of incrementing counters for ordering log events within one module.
Q) (my question so the app actually works :-): Wouldn't it be obvious to use a NoSql database [like for instance Riak] (added by trackhost Steve Vinoski) for persisting the log-events?
A) I'm actually quite fond of Postgres, but yes - it could be a good idea to add a NoSQL database like for
instance Riak. Because of the isolation of different services in the current architecture it would be quite easy for us to exchange the current database with a NoSQL variant. But introducing a distributed NoSQL database also introduce an extra distributed system and thereby it also extra complexity a sources for failure.
Q) What is the Gatekeeper doing? (see slides for more details on the architecture)
A) The Gatekeeper transforms a commit into an executable build job
Q) How do you collect metrics?
A) We use Librata metrics with a custom collector. We need to improve it. We have assigned a unique UUID to each build job request and thereby making it possible to track its log entries through the system.
Q) Does all of Travis-CI only run on AWS?
A) No it actually also runs on EC2 and Heroku and we also have dedicated hardware for the system. L