Improving how we deploy GitHub

cytzol · on Jan 25, 2021

Something I found surprising is that a change to the GitHub codebase will be run in canary, get deployed to production, and then merged. I would have expected the PR to be merged first before it gets served to the public, so even if you have to `git revert` and undeploy it, you still have a record of every version that was seen by actual users, even momentarily.

Does anyone know the pros and cons of GitHub's approach?

bswinnerton · on Jan 25, 2021

This is known as “GitHub Flow” (https://guides.github.com/introduction/flow/). I was pretty surprised by it when I first joined GitHub but I’ve grown to love it. It makes rolling back changes much faster than having to open up a revert branch, get it approved, and deploy it. When something goes sideways, just deploy master / main, which is meant to always be in a safe state.

samc98 · on Jan 25, 2021

How does this work if two (or more) engineers touch the same file at the same time?

EG: If two engineers have two different PRs open modifying `index.html`, how does that go to canary?

bswinnerton · on Jan 26, 2021

Before every deploy your branch has master merged into it. There’s some clever work by Hubot while you’re in line to deploy to create a version of your branch that has the potential new master / main branch. If conflicts arise you fix them before it’s your turn to deploy.

rtpg · on Jan 26, 2021

how deep does this deploy queue get? And does everything that lands on master have to first be deployed?

This seems like a bit tough for getting work done if you have enough people all waiting around for this

jules2689 · on Jan 26, 2021

The deploy queue usually gets to be a couple of "trains" deep which usually includes work from 4-10 devs. This represents a couple of hours. We have had issues with it taking too long, but this work I wrote about has improved that! We continue to try to improve it.

gowld · on Jan 26, 2021

Why would you revert a branch (in source) before deploying it? You already have the old known good version in the history, so you should deploy that.

halukakin · on Jan 25, 2021

I think this method seems to get more popular by day. IMHO, previously master was the branch you merge before the deploy process. But today this is reversed.

The main benefit is, other developers can rely on the master branch even more. They will know there will not be a revert on the master branch they just pulled one hour ago and already started coding on.

Kwpolska · on Jan 25, 2021

A `git revert` creates a new commit. To a developer, a revert commit appearing on master has the same effect as a pull request (or ten) being merged into it. If the revert affects code you’re working on, you will need to resolve conflicts, just like you would need to if a merged PR affected the same code.

irjustin · on Jan 26, 2021

While what you're saying is true for all situations, the of the github flow is that to a developer relying on master, problematic code never made it in.

Agreed that any codes can be added/removed but those are 100% valid changes in github flow.

lucasyvas · on Jan 26, 2021

I thought this was what tests were for :)

cutemonster · on Jan 27, 2021

Large software systems need all of tests, canaries, rollbacks.

trabant00 · on Jan 26, 2021

I've worked with this approach for the past year: we have a "release" named branch, we deploy that, run all automated tests in production, wait a couple of hours for feedback including customer support, then merge it to master.

This means master is a history of proven stable builds. And in an emergency you don't have to think about what to roll back to, it's by default master.

It's just a convention, it does not have any real benefits or downsides. You could do the same with another branch named stable, or with tags.

> a record of every version that was seen by actual users

This is covered by the release branches and also by the CD pipeline.

konschubert · on Jan 26, 2021

This only works if you have less than one release per day

gowld · on Jan 26, 2021

cutemonster · on Jan 27, 2021

> wait a couple of hours for feedback including customer support, then merge it to master

Seems one can deploy up to ... Say, 24/3? times per day? If 3 h in between

(trabant00 thanks for explaining :-))

majormajor · on Jan 26, 2021

> I would have expected the PR to be merged first before it gets served to the public, so even if you have to `git revert` and undeploy it, you still have a record of every version that was seen by actual users, even momentarily.

This sounds rather terrifying to me as well. Hopefully there's some sort of system for keeping around all those individual branches that made it out to prod, however briefly, for future debugging/auditing purposes if you ever need them.

It's never fun to have to be doing "what code was running at this time" investigations, but every once in a while it's the only way to really get to the root of something.

Existenceblinks · on Jan 26, 2021

My ex-company has similar flow, but fix the terrifying part by (off-git-hosting, i.e. in-memory) merging ALL opened PRs to master and deploying straight to staging server. Any PR can be marked as excluded from a deploy. All PRs are ALWAYS based off master. tldr; master based pull requests as source of release.

cies · on Jan 26, 2021

We also make a release branch (releases/x.y.z), tag a release candidate in it (x.y.z-RCn), build it, deploy it, wait for a bit (canary stage?), it we did not want to rollback: then merge it into master.

sandGorgon · on Jan 25, 2021

>GitHub.com is deployed primarily through chatops

What is the best chatops right now ? I dont see a lot of popularity around chatops. Its most usually some version of github based triggers.

Its funny that Github themselves uses chatops. I think that's a very nice take - especially for early stage startups. Anyone else use anything like it ?

icey · on Jan 25, 2021

We're just starting beta, but my friend Phil and I both worked together at GitHub and are building what we hope to be a better Hubot at https://ab.bot right now.

It's missing some of the chatops stuff that is mentioned in the blog post but since we support a lot more languages than Hubot we're hoping it's a matter of time before someone in our community builds a better replacement deployment script (or we'll do it while building out sample scripts :))

(Also, hi GitHub friends!)

paxys · on Jan 25, 2021

I'm guessing they are using Hubot (https://hubot.github.com/)

swagonomixxx · on Jan 25, 2021

A place I was at used Hubot as well. It gets the job done, we never really ran into a fuss. Easily extensible as well.

jules2689 · on Jan 25, 2021

This is correct :)

KinesisMagic · on Jan 25, 2021

Can anyone explain why they might go with a slack based deployment system as opposed to something more robust like CircleCI or Jenkins? Is it mainly about the simplicity of it?

sciurus · on Jan 26, 2021

Github is well know for doing chatops for everything. In fact, I think they may have invented the term.

Here's a relevant talk from 2013: https://youtu.be/NST3u-GjjFw

zug_zug · on Jan 25, 2021

As a devops person myself, I am super skeptical that there is any good reason to do a chatops deploy. My guess is "new toys are cool" / "Want this on my resume"

To be clear, it's hopefully just some connector that does slack message -> triggers jenkins job.

But from a security, compliance, reliability, debuggability, auditability perspective I think it's inferior. Not to mention an inferior interface.

eyelidlessness · on Jan 25, 2021

> My guess is "new toys are cool" / "Want this on my resume"

Whenever I read comments like this I’m always deeply suspicious of the commenter (is that how you justify trying/adopting tech) or their employer (are they so draconian in tech/design choices that everything is frozen for good). I’m not trying to cast aspersions on you or your employer directly... but it’s fascinating to me to see such a myopic take on a problem space I hope you’d agree is very much not one-approach-fits-all. I’m surprised to hear about their flow too, but my more charitable assumption is that their teams have tried different things and settled on an evolving process that works for them. They’re proud enough to boast it from the corporate blog, it can’t be entirely lark.

swagonomixxx · on Jan 25, 2021

chatops deploys aren't really new toys, a place I was at was doing them around 2013/14.

We liked it because the chat history you see is essentially a deploy history, no need to login into some other website to check some obscure logs page to see who did what. We did end up having to debug the service that processed the chat messages maybe once, but never ran into an issue when we had to deploy a hotfix.

zug_zug · on Jan 26, 2021

I enjoy having a copy of the history in slack.

What I do is have all jenkins deploys send a record to the #deploys channel (Service X, version Q deployed by person Y completed successfully in Z minutes), which comes for free with a tiny jenkins plugin.

However one of the unicorns I worked at deleted all slack messages after 3 months for legal reasons, as one example. Also, slack has periodic outages.

I think a lot of people underutilize jenkins, but once you're handy with it (and get over its god-awful ui) you never go back.

woutr_be · on Jan 26, 2021

I really don't see how the chatops approach highlighted in the blog post changes anything. It seems that they're typing a command in Slack, and this triggers a pipeline. Which is something I've been wanting to introduce for our team.

Our UAT environment is deployed with fixed versions, and is only updated either after a sprint, or when the business wants to test new features. Generally this is done by someone from the business asking to deploy a new version, and then a developer manually triggering this process. I see no reason as to why the business wouldn't just be able to do this through Slack, and not have developer act as a middleman.

zug_zug · on Jan 27, 2021

Yeah where I work product people tend to deploy UAT or QA environments. They do it in jenkins, because a stack consists of multiple services, and they may want to choose which branch to deploy. It would be cumbersome to type a combination of branch names in chat.

However, if your UAT is a manual refresh with no branch names, for example, that seems perfectly reasonable (so long as it's triggering a pipeline like you mentioned).

However if I worked where you worked and you wanted slack to deploy prod, I'd probably try to talk you out of it.

woutr_be · on Jan 28, 2021

Good to get some insights in this; in our case, we always deploy our develop branch to UAT. Because our releasable code is all the business cares about. We have another SIT environment that we sometimes use for feature branches.

My annoyance at the moment is that business side will often ask "have we deployed the latest code to UAT?", to which I quickly open Jenkins, check when the latest job ran, and revert back to them. I have tried just linking the Jenkins URL back to them, as to say, "look it up yourself". But I suspect business people just don't want to touch Jenkins, because it's a "technical tool". So my idea has always to been build a simple chat bot, where they can ask when the latest deployment was, and where they'll be able to trigger a new UAT deployment.

> However if I worked where you worked and you wanted slack to deploy prod, I'd probably try to talk you out of it.

We have pretty strict deployment processes for prod, where other teams do the deployments, and it's not even allowed to automate that.

zug_zug · on Feb 2, 2021

What I would debate doing in your use-case is have every merge to "develop" branch automatically trigger a deploy to UAT (another option is to set an autobuild every night). There's a jenkins plugin for that.

mrdonbrown · on Jan 25, 2021

My team recently put in automation so that we use CircleCI for the staging deployment, have it wait for manual approval, then deploy to production. However, we can also give the Slack staging deployment message a +1 reaction, which will automatically approve the production deployment for CircleCI. This way, we get an easy dev UX but all the CI features of CircleCI.

cutemonster · on Jan 27, 2021

How does that work? A Slack staging deployment channel "+1" message sends an outgoing webhook to CircleCI?

jules2689 · on Jan 25, 2021

It's mainly the simplicity of the deployment system as it's inline and visible, coupled with habit. It all actuality that is just what _can_ trigger the deploy, the actual deploy is based on an internal deploy application and deploys can be triggered from there as well.

pronoiac · on Jan 25, 2021

There's easy transparency amongst multiple teams, without having accounts for the other teams on CircleCI or Jenkins. This is while the deploy is in flight, and it can provide timestamped logs if there's an incident, and it could be useful for tracking history. It's also clear who kicked off the deploy.

cutemonster · on Jan 27, 2021

How do you configure who may deploy what, without having user accounts / groups in Jenkins / circleCI?

Maybe those who can post to your deployment slack channel, can deploy? So you configure who-may-deploy via permissions in Slack instead?

aszen · on Jan 25, 2021

Kind of sad to see GitHub doesn't use GitHub itself to deploy and monitor their releases.

jules2689 · on Jan 25, 2021

There is some GitHub used, but as others stated we don't want to create a circular dependency on ourselves in case we deploy something that is broken.

cbb330 · on Jan 26, 2021

Was some mix of decoupling / parallelizing the Actions service considered?

I believe the value of dogfooding would be immense. Not only could you become the customer (massive reduction to the deploy/measure feedback loop), but it would be a key marketing move.

On top of that, the GUI that will now require critical care and development is essentially a clone of what is offered by Github Actions.

jules2689 · on Jan 26, 2021

Yes there was. The change would have been quite a lot to do at once, and we aren't ready to safely add that circular dependency yet. We aren't done on this path, and this certainly won't be the last iteration. I suspect that iteration will come at some point after we can figure out the circular dependency issue.

It also makes sense internally. Our actions team is much larger than the team that manages the project I wrote/lead, so it also makes logistical sense IMO

We also started before the CD product was ready on actions and we directly influenced it, so yea haha

qppo · on Jan 25, 2021

Compilers seem to manage just fine with bootstrapping themselves. The trick is not to overwrite yourself

stavros · on Jan 26, 2021

You're suggesting they run a second Github somewhere?

mappu · on Jan 26, 2021

I think GitHub Enterprise includes all the features they would need.

WJW · on Jan 25, 2021

That seems like an extremely good idea actually, since if you dogfood your own releasing service then you can't fix it anymore if you accidentally bring down the service.

Xorlev · on Jan 25, 2021

That also means when it does go wrong, it takes much longer to fix. Good operational practice is to decrease MTTR, not make it worse.

xxpor · on Jan 25, 2021

That's usually solved with a parallel stack deployment, use the other stack if something is broken

paxys · on Jan 25, 2021

If the "other stack" isn't regularly used then you can assume it will be broken when needed

cpascal · on Jan 25, 2021

You just run the previous version of the production stack in your "dogfood/operations" stack. Once you've fully rolled out production and have vetted it, you can upgrade the other one to match production.

epidemian · on Jan 26, 2021

With large codebases maintained by many people, sometimes it's difficult to "just" do things like that. It's a bit weird to think that no one within that large group of professional developers hasn't thought of doing a simple obvious solution. It's probably not that simple.

cutemonster · on Jan 27, 2021

Lots of comments here around that say "just do it like this: ...", or "why don't you: ... it'd be good because ..."

marcinzm · on Jan 26, 2021

And then you get hit with a subtle bug that utterly nukes your system when the year changes from 2020 to 2021. Now you can't deploy anything because both systems are down. Given Github's scale and number of engineers it's basically guaranteed that they'll hit some sort of bug like that eventually. Not all bugs act immediately.

notwhereyouare · on Jan 25, 2021

I did a short stint at wayfair and about 1-2 months in, there was a deploy that somehow got passed the test flow and when deployed took down their entire site. So badly that they couldn't even deploy the fix

cutemonster · on Jan 27, 2021

How did they solve that in the end?

Existenceblinks · on Jan 26, 2021

It sounds like a good idea, but is very bad one. I worked on a git deploy product and dogfood was very appealing, turned out it's confusing as fuck. And When things get wrong it's even more confusing, it's like horror time travel movie when you can't find origin to fix consequences.

illnewsthat · on Jan 25, 2021

I was surprised to read that they are using Slack since it is such a competitor to Microsoft's Teams (parent company).

dubcanada · on Jan 25, 2021

Probably because Teams is the worst.

More then likely it's because that's what they used before they got bought and haven't been forced to migrate over yet, they also seem to have bots, which are not really a direct copy and paste into MS Teams, and likely them converting over isn't a high priority.

jen20 · on Jan 25, 2021

IIRC GitHub used to use Campfire and it took a long time to switch to Slack - a switch to Teams would no doubt take a long time too!

paxys · on Jan 25, 2021

Easy to switch a chat application, hard to switch your entire chatops ecosystem. This blog post shows the perfect example of that.

kuschkufan · on Jan 25, 2021

Are you expecting them to use Windows everywhere as well?

names_are_hard · on Jan 25, 2021

My understanding of Microsoft policy is that it's easier to buy macbooks for your developers than it is to buy Slack. Which makes sense, because they're currently doing head to head with slack for market share right now, while a few macbooks doesn't threaten their credibility when selling windows.

My guess is that github was using slack before they were bought and inertia is a thing. I'm sure there are people within the parent company that would like to see them transition, but I'm sure there's a ton of resistance, especially "on the ground" at github. Buyouts are a delicate thing, they don't want to ruin github by trying to force it to change too quickly.

dubcanada · on Jan 25, 2021

No but why would you use a product that is $7 or what ever times the number of employees (so let's say 200, so $1400 a month) when you can use a free one.

paxys · on Jan 25, 2021

$1400 a month is less than a rounding error for a company that size. If you can get even the tiniest bit of extra developer productivity from the software then it is worth it.

And Github will definitely still have to "pay" for Teams, whether that is internal accounting or actual money being exchanged.

maccard · on Jan 25, 2021

Speaking from experience, just because you work for a company doesn't mean you can use all of their products (or that you'll even get favorable pricing on them).

theshrike79 · on Jan 25, 2021

On the other hand sometimes it means you MUST use the company products.

Consulted for a sub-sub-sub-subsidiary of Toshiba. All computer equipment had to be from Toshiba - the closest place to get Toshiba laptops was two COUNTRIES over.

They even had to tape over non-Toshiba branding from external displays that would be visible.

lostapathy · on Jan 25, 2021

I'd love to hear this story! Seems crazy ... but we live in a crazy world.

josephg · on Jan 25, 2021

My uncle used to work at Compaq (back before they got bought by HP). When their computers broke, his team had to pay their support staff to get them fixed. (Via internal budgeting). But the support team knew internal customers would call them anyway and it was still compaq’s money, so they charged several times more for internal support calls than normal support calls.

My uncle’s team was having none of that, so they paid an external computer repair service to fix their computers. The external repair service subcontracted to compaq’s internal people anyway, so when their computers broke they called up (and paid) external consultants. Who in turn called compaq’s internal support team, who came downstairs and fixed their computers at a competitive price.

names_are_hard · on Jan 25, 2021

At Microsoft if you build a product using Azure (and if you want to use the cloud you MUST use Azure, you're not going to get approval to write a check to AWS) the costs come out of your budget. And it's taken seriously, to the point where teams will very much emphasize managing costs (what will this new feature cost on our Azure bill? Can we build it more efficiently? Oh wow, that refactor saved us 100k/month in cloud costs, don't forget that when we start talking about promotions...)

lostapathy · on Jan 25, 2021

That makes sense since the amount you could use is variable. I was thinking more like somebody couldn't get a free word license at a MS subsidiary or something.

vulcan01 · on Jan 25, 2021

When I worked at MS Azure, we had to pay for Azure servers! (I believe our team had a $5k/month Azure bill.) It's part of internal budgeting, so that people within MS don't splurge on expensive things (because it does cost MS money for each person on Teams).

names_are_hard · on Jan 25, 2021

Did you drop a k? What can you do with 50 dollars?

vulcan01 · on Jan 25, 2021

Yes, thank you, it should be $5k. Edited.

maccard · on Jan 26, 2021

Very mundane I'm afraid. Worked for a MS subsidiary, on an online game for xbox and PC. Developed on windows, using visual studio and deployed on azure, used TFS for bug tracking. All of the above costs were tracked rigorously, and charged to the project. Most frustrating was complying with the visual studio licenses across the board, with no assistance from the licensing team. We had an account manager for all of the above but my understanding is that he was more of an auditor than anything else.

scott_w · on Jan 25, 2021

Unrelated to software but the company my dad works for (motor repair) has to buy all its parts from its own distribution arm, at the marked up price. He then has to turn a profit on those parts as well as pricing the labour.

If cost price is £5 and the markup is 20%, he has to pay £6 to get the part, then charge £7.20 on the invoice to the customer. I’ll let you guess what that does to tender bids ;-)

eyelidlessness · on Jan 25, 2021

In fact, from what I've heard, Microsoft generally charges internally for usage of products across orgs.

NordSteve · on Jan 26, 2021

Services, yes. Code is shared across the board at no cost.

hoprocker · on Jan 25, 2021

This is generally a good flow, but something that absolutely baffles me is that GitHub changes the commit SHAs when branches are rebase-merged from PRs[0]. This totally breaks a fundamental notion in Git that the same work, based on the same commits, has the same hash. It also makes it incredibly difficult to determine which PR branches have been merged into master.

[0] https://docs.github.com/en/github/collaborating-with-issues-...

leafmeal · on Jan 25, 2021

> This totally breaks a fundamental notion in Git that the same work, based on the same commits, has the same hash.

This is not a fundamental notion in git. The hash includes timestamps as well.

edit:

It also contains author and commiter details. Here's a slightly revised list borrowed from a blog post[1]

- commit message

- The file changes

- The commit author (and committer- they can be different)

- The date

- The parent commit hash(es)

[1]: https://www.mikestreety.co.uk/blog/the-git-commit-hash

juped · on Jan 26, 2021

That is not something Github is doing, it's fundamental to how git works that different commits have different hashes - and rebasing creates different commits (they have different parents).

Not rebase-merging would probably suit your workflow better.

comfydragon · on Jan 25, 2021

The page you linked to does subtly explain why that happens:

> Rebase and merge on GitHub will always update the committer information and create new commit SHAs

lordgilman · on Jan 25, 2021

I really don't think this is a "fundamental notion in Git." They ship git-patch-id to do what you're trying to do and frequently used internal tools like git-cherry use it. It's also not a true statement for LKML-like workflows that are landing patches off of mailing lists with git-am or git-apply.

bomdo · on Jan 25, 2021

I'd love to learn more about their canary rollouts. Is there any more info from either them or similar large sites about this?

For example, what usually has to happen for a dev to trigger a rollback? Or how do they handle stateful changes such as database schema changes?

t3rabytes · on Jan 25, 2021

Re db migrations: they've built their own DB management tooling (https://github.com/openark/orchestrator) and online migration tooling (https://github.com/github/gh-ost)

abrkn · on Jan 26, 2021

Is there anything like this for PostgreSQL?

jules2689 · on Jan 25, 2021

We monitor Datadog dashboards, exceptions, and other metrics mainly, as well as smoke testing the application

mapme · on Jan 26, 2021

Curious what does the “Debug” button do on the right hand side?

jules2689 · on Jan 26, 2021

a ton of internal debug logs :) We log all state machine transitions, poll the state machine to make sure it stays running, and do a ton of checks in the process. The debug button is a timeline of all of that. Basically trace-level logs

Xorlev · on Jan 25, 2021

I was surprised to see their canary stages are just 5 minutes. Many problems take longer to manifest. That seems like a fairly risky release process.

jules2689 · on Jan 25, 2021

It's actually longer than 5 minutes. There is the duration of the 2% canary deploy where we start to see pick up of traffic, a 5 minute wait, then a 20% "deploy", and a 5 minute wait. All in all this comes out to around 10-15ish minutes in canary. This is a stage where we can almost instantly shut off traffic to the canary deploy.

Could we reduce risk by lengthening the process? Maybe, but you also make deploys longer which means less stuff can get through in a day. This makes devs respond with larger PRs, for example, which increases the risk profile.

So we need to balance time and duration. Typically large problems will manifest quickly, or take a lot longer to detect (and thus are generally more minor problems) when you have our scale of a user base in my experience.

cutemonster · on Jan 27, 2021

> around 10-15ish minutes in canary

10-15 is fast I think

Sounds as if you can do more than 100 deployments per day? -- but I guess you don't do that many?

paxys · on Jan 25, 2021

The problems that don't immediately manifest could very well take hours or days or longer. There has to be a limit, and 5 minutes is as good as any.

closeparen · on Jan 25, 2021

A lot of alerts use moving averages or sustain times to squelch transient noise. You have to wait for the max sustain time to pass before you can conclude that lack of alert = lack of problem.

That time could very well be 5 minutes but the two need to be coordinated.

wdb · on Jan 25, 2021

Yeah, wouldn't you need some sort of minimum amount of traffic to be able to use canary deployment?

alexchamberlain · on Jan 25, 2021

That's pretty awesome to go from nothing to full production in 15 minutes. I would like to encourage others to bear in mind that simply adding more time wouldn't significantly decrease the risk of things going wrong.

zoobab · on Jan 25, 2021

Github source code did not leak recently?