The most exciting moment of the week was sneakily watching the video feed from ESA control centre in Darmstadt as the tension mounted twenty-seven or so minutes after the Philae lander was due to have grounded (if “grounded” is really an appropriate word to use for a spaceship landing on a 2.5 mile wide comet 300 MILLION miles after a journey of very nearly four BILLION miles). And then, even with the sound off – hard to be sneaky, otherwise – seeing the mission staff erupt into hugs and knowing that meant that they’d pulled the whole crazy plan off. An amazing, amazing moment.
And then… it wasn’t quite right. The uncertainty started. Was the lander still on the surface? Had it anchored? Would the intermittent communication stabilise or disappear? Could the solar panels get power? The speeches and celebrations were already underway, but the outcome of the mission was still in the balance.
Which got me thinking about deployment planning for a website launch or update: after weeks, months or years of work, the last thing you want to happen is for the product to go wrong just as you launch it. Yes, as a topic this is fundamentally a massive comedown from monumental space expeditions but sorry, that’s how my brain works. It’s the despair of all within conversational range of me.
The security concept of defence in depth can be applied here. Let’s call it planning in depth, because that’s just the kind of coruscating creativity my cranium can conjure on a wet November Friday. Defence in depth means having multiple layers of defence, so that even if defence is breached, an attacker has to overcome several other defences before they can harm you in any way. Likewise successful planning in depth means that even if one thing goes wrong, several other things have to go wrong before your plan is compromised. And the Rosetta mission is a great example of planning in depth done brilliantly.
So what does planning in depth consist of? There are a few core components:
Okay, so your deployment plan is unlikely to need to factor in banking off planets four separate times like a cosmically brilliant pool player (played for and got, ESA, played for and got;), but unless it’s a really simple deployment, there are going to be several moving parts you need to consider. Do you need Facebook approval for permissions on an app? Do you need to modify a database schema ahead of deploying a code change? Do you need to synchronise with a set of changes happening simultaneously on another platform? If the sequence hasn’t been planned in detail ahead of time, the chances are that required steps will be missed out and you’ll be scrambling to repair your launch process in the middle of running it.
The Rosetta team had to automate Philae’s landing sequence because of the 28 minute delay in communications each way. But automating what you can ahead of time is good practise anyway – running a pre-prepared script massively increases your chances of a trouble-free rollout. Your process will be quicker, you won’t be making manual errors in the heat of the moment, and best of all you get to test the script beforehand so you can verify it’s going to work before you start.
It’s important to consider your landing conditions carefully to minimise the risks. For Philae, that meant mapping out the landscape of the comet and looking for a smooth landing spot. For the rest of us, that typically means choosing a time of day when traffic is low. But you also need to think about the surrounding landscape – if you’ve decided to deploy at 1AM because it’s normally a quiet time, make sure you’ve accounted for the reporting processes that normally run at 2AM (because it’s normally a quiet time).
Most projects will still have known issues when they launch – minor defects that are planned to be fixed within subsequent maintenance releases. The deployment team need to know what these are so that they don’t get derailed by finding them during the course of the release and presuming that they’re showstoppers.
The tests for the launch need to be designed in detail in advance, including functional tests, performance tests, and regression tests for functionality that isn’t supposed to have changed. Trying to wing it on the night will lead to issues being missed, possibly till too late.
Then think about steps to deal with any issues you encounter. If you don’t preplan, at best you’ll be trying to come up with plans on the fly (when you’re probably already tired); most likely, those plans will all turn out to need resources you can’t possibly get hold of without advance notice.
ESA didn’t have any possibility for rolling back to the previous version – they only had one shot. But if you run into a showstopping issue, your best plan may be to revert to the previous known good version of your product so you can maintain service whilst fixing the problem.
Make sure you understand what it means to rollback versions at any given point, including post-live. This may involve, for example, having to preserve data gathered before the rollback for later rescue – if you haven’t already planned to take a copy of the relevant database, chances are you won’t suddenly think to do it whilst already under stress.
You probably don’t need a set of speeches ready for the world’s cameras. But you will need to communicate with project stakeholders. Know who, when and how ahead of time.
Your team leads are probably going to be the best people to handle the launch procedure. They’ll also be the best people to handle the issues found immediately after your launch. They were probably also the best people to prepare for the launch. But they can’t work through continuously, so they need to be able to handover to relief teams, possibly over their objections. Tired people can’t come up with the best solutions.
Just before the lander deployment, the ESA team discovered the cold gas thruster meant to hold Philae onto the surface was (probably) malfunctioning. Things can happen in the run-up to a deployment (data link to the DR site will be down? yesterday’s backup didn’t validate? a last minute defect found in a key monitor?), and the whole team need to know the situation to be able to make the right decisions. Update the plans as thoroughly as possible, too.
Despite everything, you might not have considered every eventuality. Knowing your objectives means you have the ability to make informed decisions quickly. For the ESA, it was discovering that the lander had bounced twice and the harpoons hadn’t fired – should they fire the harpoons now that the lander had settled? Knowing that many of the experiments planned could be completed without firm anchorage on the comet but that firing the harpoons unsuccessfully could dislodge the lander from the comet, ESA were able to decide to proceed with the experiments they were able to run with first, rather than risk not being able to complete any of them.
Knowing what is critically important is the key to making the right decision quickly in difficult circumstances. Tracking tags working correctly is important; customers completing the transactions to be tracked is critically important – your team needs to know to prioritise the latter at the expense of the former, if necessary.
Completing the deployment is the end of one period of hard work, but also the beginning of another. Give yourselves a moment for celebration in between.
Ultimately it’s about preventing bad decisions. Bad decisions are made by tired people, by people with a tight timeframe, by people without the information or resources they need. Planning in depth is about making decisions when the conditions are right for them to be good decisions, so when the time comes all that’s left is to act upon them. And that’s why, despite operating at the extremes of human capabilities, despite battling through huge obstacles, the ESA team in Daumstadt are working on getting as much data as possible from Philae instead of starting the post-mortem into why they didn’t get any data at all.