I was surprised to see their canary stages are just 5 minutes. Many problems tak...

jules2689 · on Jan 25, 2021

It's actually longer than 5 minutes. There is the duration of the 2% canary deploy where we start to see pick up of traffic, a 5 minute wait, then a 20% "deploy", and a 5 minute wait. All in all this comes out to around 10-15ish minutes in canary. This is a stage where we can almost instantly shut off traffic to the canary deploy.

Could we reduce risk by lengthening the process? Maybe, but you also make deploys longer which means less stuff can get through in a day. This makes devs respond with larger PRs, for example, which increases the risk profile.

So we need to balance time and duration. Typically large problems will manifest quickly, or take a lot longer to detect (and thus are generally more minor problems) when you have our scale of a user base in my experience.

cutemonster · on Jan 27, 2021

> around 10-15ish minutes in canary

10-15 is fast I think

Sounds as if you can do more than 100 deployments per day? -- but I guess you don't do that many?

paxys · on Jan 25, 2021

The problems that don't immediately manifest could very well take hours or days or longer. There has to be a limit, and 5 minutes is as good as any.

closeparen · on Jan 25, 2021

A lot of alerts use moving averages or sustain times to squelch transient noise. You have to wait for the max sustain time to pass before you can conclude that lack of alert = lack of problem.

That time could very well be 5 minutes but the two need to be coordinated.

wdb · on Jan 25, 2021

Yeah, wouldn't you need some sort of minimum amount of traffic to be able to use canary deployment?