miketria's comments

miketria · on April 13, 2022

Hi, this is Mike from Atlassian Engineering. Strongly agree with this. I'd say that if you can afford it, don't do the hard deletes on a schedule though. You never know when there's a system out there referring to soft deleted data that fails once the data is hard deleted. Hard deletes should feel frightening because they are frightening.

a-dub · on April 13, 2022

i disagree for one reason. you really don't want the tooling or the process to rot. running it automatically normalizes the scary. otherwise you have bespoke tools in indeterminate states being run by people who are learning how to run them again. that's when i believe things get dangerous.

if it forces additional fail safes or backups to be able to do so safely, then that's probably a good thing to have anyway, no?

miketria · on April 13, 2022

Hi, this is Mike from Atlassian Engineering. Not a minor issue. Once we knew the extent and severity of the incident, we had hundreds of engineers engaged and working to restore service.

Aissen · on April 14, 2022

I should have clarified, that I was talking about leadership's external communication on the incident, like in the article. Nobody doubted you were working around the clock, or with lots of people involved.

miketria · on April 13, 2022

Hi, this is Mike from Atlassian Engineering. You are right that the checks need to improve to reduce human error, but that's only half of it. I don't see this as human error though. It's a system error. We will be doing some work to make these kind of hard deletes impossible in our system.

miketria · on April 13, 2022

Hi, this is Mike from Atlassian Engineering. For the customers impacted by this incident covered by an SLA, we will adhere to our contractual terms. However, given the long duration of this outage, we are planning to go above and beyond for our impacted customers. We are currently focused on restoring service, but after that will be discussing how we can make it right for each impacted customer.

encryptluks2 · on April 13, 2022

It looks like you are focused on Hacker News comments.

miketria · on April 13, 2022

Hi, this is Mike from Atlassian Engineering. You are right the communications from us have not lived up to our standard. We will focus on this specifically once we restore service and get the post incident review out there. More details here: https://www.atlassian.com/engineering/april-2022-outage-upda...

lallysingh · on April 13, 2022

Spamming HN isn't helping your cause man.

Mysterise · on April 14, 2022

There is irony in complaining about over-communication when it's in response to criticisms of under-communication.

lallysingh · on April 14, 2022

Key word "spamming." It wasn't communication but another dry and information-free blob of text. Communication requires something to say.

dhzhzjsbevs · on April 15, 2022

It's worse than that, they're saying communication was not up to their standards without actually communicating anything we didn't already know.

At least explain why there was such a total communication blackout company wide. Even support staff weren't allowed to discus it. Why?

2muchcoffeeman · on April 14, 2022

Well why are they writing a blog and posting the link on HN? We’re not directly your customers. Did you apologise individually to the customers you ignored? You don’t have to apologise to anyone here.

miketria · on April 13, 2022

Hi, I'm Mike and I work in Engineering at Atlassian. Here's our approach to backup and data management: https://www.atlassian.com/trust/security/data-management - we certainly have the backups and have a restore process that we keep to. However, this incident stressed our ability to do this at scale, which has led to the very long times to restore.

Rantenki · on April 14, 2022

Hey Mike; Not dumping on you personally, but the RTO claims to be 6 hours. I can understand that being a target, but we're at 32X that RTO target, with a communicated target date of another 12 or so days IIRC. That's literally two orders of magnitude longer than the RTO. I don't think any rational person would take that document seriously at this point.

I'll also ask (since nobody else has answered, I may as well ask you as well):

1. Are the customers actually being restored from backups (and additionally, by a standard process)?

2. Will the recovery also include our integrations, API keys, configuration and customization?

miketria · on April 15, 2022

Hi Ranteki, you're right that the RTO for this incident is far longer than any of the ones listed on the doc I linked above. That's because our RPO/RTO targets are set at the service level and not at the level of a "customer". This is part of the problem and demonstrates a gap both in what the doc is meant to express and a gap in our automation. Both will be reviewed in the PIR. Also, the answer to (1) and (2) is yes.

atlasgone · on April 14, 2022

A friend in Atlassian engineering said the numbers on the trust site are closer to wishful thinking than actual capabilities, and that there has been an engineering wide disaster recovery project running because things were in such bad shape. The recovery part hasn't even started. If Atlassian could actually restore full products in under six hours, they should have been able to restore a second copy of the products exclusively for the impacted customers.

bradknowles · on April 16, 2022

Nah. The RTO/RPO assumes that only one customer that has a failure big enough to require a restore.

When the entire service is hosed, that's a totally different set of circumstances, and you have to look at what the RTO/RPO are for basically restoring the entire service for all customers. And since the have more than a thousand customers, it totally makes sense that it would take orders of magnitude longer to restore the entire service.

palijer · on April 14, 2022

I think this document and incident is a decent example of common DR planning failure patterns.

It is explained here that Atlassian runs regular DR planning meetings with the engineers spending time planing out potential scenarios, as well as quarterly tests of backups and tracking findings from them.

So, with those two things happening, I the imagine recovery time objectives of <6 hours was taking a typical "we deleted data from a bad script run affecting a lot of customers" scenario into account with the metrics from the quarterly backup tests.

That doesn't even come close to the recovery time we are currently seeing now however. We're coming up on 2 orders of magnitude more than that.

The above doc seems pretty far our of line with what is currently happening.

sizzle · on April 13, 2022

How’s the atmosphere internally Mike? Must be crazy times there. I know this isn’t your fault, so hang in there. Cheers!

MuffinFlavored · on April 14, 2022

Whose fault is it? Is it any one person/team’s fault? Management? Culture?

“Corporations are people too”

mdoms · on April 14, 2022

400 tenants doesn't seem like that much scale though...? What will happen if there's an incident affecting more than 0.18% of tenants?

mulletbum · on April 14, 2022

It's 400 tenants scattered across all their servers. So they are most likely having to build out servers to pull the data then put it in place. 10x the problem that just restoring a single server would be.

encryptluks2 · on April 13, 2022

You mean your poor practices and bad design. The only way to prevent this type of issue in the future is to admit the failures.