Sunday, 6 November 2011

BBM and Siri outages, a failure in more ways that you think.

Morning sports fans! Yes, I've missed you too, but I'm having a super perfectionist phase and none of my posts seem good enough to publish. This should all blow over and there will quite a few post some time in the future. So, let's wind the clock back a smidge and remember one of the biggest fails of the year: The Great BlackBerry Outage of 2011! (Yeah, I'm expecting more to come.)

So, cast your mind back to October 10th-ish when the first reports of a RIM server crash came in. Millions of people were left without access to BBM and some Internet services, such as Facebook. Ah, the many jokes we made that they didn't see. Well it quickly spread to North America and then other planets! (BONUS QUESTION: How many of these planets do you know?) It was somewhat fitting that BlackBerry users who were fairly vain about BBM had it ripped from them for a couple of days. It was a good thing.

Eventually, RIM apologised, service and the status quo were restored. There was still the great debate of BlackBerry vs. iPhone, (as explained here by Jimmy Carr and Sean Locke on 8 out of 10 Cats) but the iPhone users had a little chip on their shoulder that said "We never have service outages." This was compounded by the fact the release of the iPhone 4S, and with it Siri, was imminent. Just to catch you up, Siri is the voice activated personal assistant that comes with the iPhone 4S. (For further details see this)

Anywho, Siri is now here and people are enjoying asking it silly questions, demonstrating which accents it can't understand and showing that it's only fully functional in USA. What I was, until recently, unaware of is that Siri runs in the cloud. I have no love for cloud computing, but will ignore that at this juncture. A couple of days a ago a failure caused Siri to be unable to connect to the Apple servers and thus not work. Wait, you mean Apple has service outages as well? *le gasp*! Well of course they do! The reason is simple,they seem to have overlooked a very basic principle of computer security: critical infrastructure.

What is critical infrastructure you ask? Good question! Critical infrastructure is an old-ish field which studies an setup and sees what it would take for that to stop working. The classical example is a very nice graph theoretic problem, which is quite nicely demonstrated by the London Underground map. Assume this your only means of transport. Pick any station and/or section of the map. The problem is can you make a single cut and isolate that station/section from the rest of the map? There are variants, such as the minimum number of cuts needed to isolate a station/section and also on other things such as electricity, water and gas supply. You get the gist of it all, right?

The same can be done for communication and telecommunication networks. This is normally done, but it can be a bit tricky. With wired communications, it's easy to draw up a graph-style map, with each wire as an edge and each node as a vertex. However the same is not really true of wireless communications. To stop wired communications between point A and B, you need to sever the wire joining them. It's not as clear what the equivalent for wireless communication is. There is also the issue that unlike wired devices, which are immobile, wireless devices by definition are mobile.

So, now do we consider simply the connection between the devices or do we also have to consider the location? Can we only consider one or do we have to consider both? If I go into a lift and lose wireless connectivity is that a failure of the network or the device or both or neither? If you are thinking such distinctions are a moot point, then you are pretty much correct. Yes, it's not a major issue, but it should not be completely overlooked. There are a lot more examples of this, but that would mean delving into technicalities, which I would rather not do.

And there is the issue of time. These things take time, quite often a lot of it. There are so many contingencies to consider, such as the classic CTO chokes on sushi, rest of the department is killed in a meteor strike and the only other guy who knows the password gets retrograde amnesia. Yes, that is a tad far-fetched and one should probably stop when retrograde amnesia is the most likely event in your scenario. The digital market thrives on speed. You need to get the next product out there 2 weeks before the previous one is launched.

So, as you can see, owing to several issues, the critical infrastructure analysis is possibly not done as well as it should be, which can cause these kinds of issues. On the other hand, you can do the most thorough analysis and the worst case scenario may still occur, thus causing an outage. So basically it's all a roll of the dice and remember "God doesn't play dice!"

No comments:

Post a Comment