Skip to content

Notes on Joyent’s outage (which also affected Twitter)

Seems that Joyent experienced some downtime (8 days) due to some bugs in OpenSolaris that got triggered during an upgrade. (Read here and here)

One rather well known service was also affected by this outage as they are hosted by Joyent.

I guess this does not reflect very well on running OpenSolaris in a production environment. Especially, if downtime cannot be tolerated. I guess even with competent Solaris folk on board at Joyent this type of stuff can happen.

The stories point out that the bug that affected them was over a year old in the OpenSolaris bug database and seems to point some blame at Joyent for not reacting sooner to fixing the problem. However, the reality is that upgrading systems is never a button click like it is to upgrade some piece of software on most people’s Windows machines. In a non-trivial production system, there are usually multiple intertwining dependencies that make it difficult to change the system on the fly. Usually one tries their best to isolate the interdependencies of different pieces so that it IS possible to swap out pieces without affecting other pieces. But that is the ideal to work towards.

This is one reason why having windows of maintenance as well as trying to practice failure fire drills in some manner can help in trying to catch these issues before they happen. However, the trick is getting time to perform windows of maintenance and being able to duplicate enough of the system in a test environment to be able to run a fire drill. Sounds easy on paper, very difficult in practice.

{ 1 } Trackback

  1. [...] can and will go wrong: Human error, software problems, power failure, and more. These companies have lots of money for redundancy too, but things still [...]