mandag den 30. september 2013

The Smallest Distributed System

I attended the "The Smallest Distributed System" talk by Mathias Meyer at this years GOTO Aarhus, and I quite liked it. The speaker got his message through to the audience and he touched upon a lot of the important points to remember when doing a distributed system. The points were nicely exemplified, by presenting his experiences with the Travis-CI system and the problem this system faced growing from 700 builds per day in 2010 up to currently 70000 builds per day. The main lessons that he advocated for were the following:

Visibility


You need to be able to monitor your system and at real-time be able to tell what is broken. When you get this visibility, it will lead to the need of responsibility because you need to react to the problems you see in production. The visibility obtained in the Travis-CI system led to a bunch of restructuring and refactoring of the system. Time-outs against external APIs (e.g. github api) were changed from 10 minutes to 20 seconds, and a retry mechanisme was introduced enabling the system to fail fast and also to ignore when non-important requests failed.

Uncertainty ... Modularity


You need to deal with the uncertainty associated with running a distributed system in production. There will always be something failing or performing badly somewhere in a complex distributed system. To be able to deal with this, the codebase need to be well-structured with simple and small dependencies between the different modules in the system. In the Travis-CI codebase, they had (and probably still have :-) a big ball of mud, where everything heavily depend on the travis-core module. So in short to deal with complexity and uncertainty you need modularity.

Simplicity


One of the problems in the Travis-CI architecture was the logging module and the need for ordering logs to be able to display and persist them in the correct order. The first implementation had to synchronise log events to ensure the ordering. Changing the solution slightly by having each log entry know its order, within the full list of events in the a build job, simplified this radically. Having the log entries know their order, made it possible to scale-out the logging functionality. So simplicity made it easier to scale. Simplicity also made it easier to reason about what for instance went wrong when having a breakdown i production.

Mathias concluded by mentioning that in the Travis-CI system, they still have a problem with scalability because of the relational Postgres database being a bottle-neck.

Q&A


The most interesting part of the session actually was the Q&A part. Here is my take on the questions and answers:

Q) What book can you recommend on this topic?

A) None that I know of. I read papers instead. You can find reading lists on various blog posts. Leslie Lamport (http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html) has a written numerous papers on the topic which can be found on the Microsoft Research web page. Steve Vinoski (the Track Host) recommended the work published by the Distributed Systems Research Group at MIT and also Michael Nygaards Release IT book was mentioned.

Q) Are you using circuit breakers?

A) No but we are considering it and will probably implement it in the future. Matthias explained the concept, you can read about in Michael Nygaards work. In short it is a pattern that when implemented open and closes the connection to a service based upon a heuristic about how many erroneous responses the service has returned.

Q) Have do you share a temporal clock between the modules in the distributed system?

A) We don't. We have isolated the responsibility of incrementing counters for ordering log events within one module.

Q) (my question so the app actually works :-): Wouldn't it be obvious to use a NoSql database [like for instance Riak] (added by trackhost Steve Vinoski) for persisting the log-events?

A) I'm actually quite fond of Postgres, but yes - it could be a good idea to add a NoSQL database like for 
 instance Riak. Because of the isolation of different services in the current architecture it would be quite easy for us to exchange the current database with a NoSQL variant. But introducing a distributed NoSQL database also introduce an extra distributed system and thereby it also extra complexity a sources for failure.

Q) What is the Gatekeeper doing? (see slides for more details on the architecture)

A) The Gatekeeper transforms a commit into an executable build job

Q) How do you collect metrics?

A) We use Librata metrics with a custom collector. We need to improve it. We have assigned a unique UUID to each build job request and thereby making it possible to track its log entries through the system.

Q) Does all of Travis-CI only run on AWS?

A) No it actually also runs on EC2 and Heroku and we also have dedicated hardware for the system. L

onsdag den 4. september 2013

Constrained Innovation

Abstract

My claim is that when you constrain people it fosters creativity. Let us dive into some exciting examples.

Danish Dogme

Lars von Trier made a complete fool of himself by throwing the dogme manifesto at participants of the Cannes festival in 1995. The idea was to constrain moviemaking with a set rules (Dogme 95). The hope was to catch a glimpse of reality. This set of rules started the Dogme film trend. Together with his three "disciples" the dogme-brothers made the movies: Idioterne, Mifunes sidste sang, Festen and The King Is Alive. This trend made movie-makers around the world rethink movie-making and at a point, it was almost considered inappropriate to not make handheld movies. I am a big von Trier fan so I could babble on about his greatness for a while but I will spare you :-)

Kashmir's constraints experiences

I'm also a big Kashmir and Kasper Eistrup fan. Kashmir's latest album E.A.R was created with a set of rules around the magic number 12. The rules were:

A. Record the album within 12 months
B. 12 tracks
C. 12 Instruments (max)
D. Release-date: 2012-12-12

At first you would assume that these constraints would make the creative process of writing music stop. It actually had the exact opposite effect! The rules resulted in the band having a lot of extra tracks (rule B violated). Reviewers loved the album which was released 2013-18-03 (rule D violated). Actually, I reckon, that the band only fulfilled two of the rules at the end, namely rule B and C. A danish blogger describes this process beautifully - Ørevoks

Gaming

Another example of how constraining people makes them more creative, is team building games like mashmellow-towerbuilding (The mashmellow challenge). My experience is, that these constraints make people extra creative and, if lucky, there is chance that this bubbling creativity lead to ideas. Ideas which in the end, if even more lucky, lead to innovation and when the probability approaches the improbable it might also lead to business value.

One of the talks, I still remember, from the GOTO Copenhagen 2012 conference was the Playmaking - Transforming Work Through Play talk by Portia Tung. The talk was designed with a bunch of games and she concluded that gaming is necessary for human beings. I could not agree more. When my life becomes dull, it is typically because the "play vs. no-play" ratio of daily activities is to low.

Boiling ideas

In conclusion

All the above exemplifies the general thesis, that when you constrain people it fosters creativity. The illustration above is my (best effort :-) attempt to illustrate what happens, when you place two professors in a large tea pot and start boiling the water. Hopefully they come up with an idea that prevents slow and painful death.

At this year's GOTO conference in Aarhus, I am looking forward to The beauty of Constraints, Faruk Altes talk. My hope is that Faruk will enlighten me further on this subject or even better he will surprise by talking about something (completely) different.