Something I found surprising is that a change to the GitHub codebase will be run in canary, get deployed to production, and then merged. I would have expected the PR to be merged first before it gets served to the public, so even if you have to `git revert` and undeploy it, you still have a record of every version that was seen by actual users, even momentarily.
Does anyone know the pros and cons of GitHub's approach?
This is known as “GitHub Flow” (https://guides.github.com/introduction/flow/). I was pretty surprised by it when I first joined GitHub but I’ve grown to love it. It makes rolling back changes much faster than having to open up a revert branch, get it approved, and deploy it. When something goes sideways, just deploy master / main, which is meant to always be in a safe state.
Before every deploy your branch has master merged into it. There’s some clever work by Hubot while you’re in line to deploy to create a version of your branch that has the potential new master / main branch. If conflicts arise you fix them before it’s your turn to deploy.
The deploy queue usually gets to be a couple of "trains" deep which usually includes work from 4-10 devs. This represents a couple of hours. We have had issues with it taking too long, but this work I wrote about has improved that! We continue to try to improve it.
I think this method seems to get more popular by day. IMHO, previously master was the branch you merge before the deploy process. But today this is reversed.
The main benefit is, other developers can rely on the master branch even more. They will know there will not be a revert on the master branch they just pulled one hour ago and already started coding on.
A `git revert` creates a new commit. To a developer, a revert commit appearing on master has the same effect as a pull request (or ten) being merged into it. If the revert affects code you’re working on, you will need to resolve conflicts, just like you would need to if a merged PR affected the same code.
While what you're saying is true for all situations, the of the github flow is that to a developer relying on master, problematic code never made it in.
Agreed that any codes can be added/removed but those are 100% valid changes in github flow.
I've worked with this approach for the past year: we have a "release" named branch, we deploy that, run all automated tests in production, wait a couple of hours for feedback including customer support, then merge it to master.
This means master is a history of proven stable builds. And in an emergency you don't have to think about what to roll back to, it's by default master.
It's just a convention, it does not have any real benefits or downsides. You could do the same with another branch named stable, or with tags.
> a record of every version that was seen by actual users
This is covered by the release branches and also by the CD pipeline.
> I would have expected the PR to be merged first before it gets served to the public, so even if you have to `git revert` and undeploy it, you still have a record of every version that was seen by actual users, even momentarily.
This sounds rather terrifying to me as well. Hopefully there's some sort of system for keeping around all those individual branches that made it out to prod, however briefly, for future debugging/auditing purposes if you ever need them.
It's never fun to have to be doing "what code was running at this time" investigations, but every once in a while it's the only way to really get to the root of something.
My ex-company has similar flow, but fix the terrifying part by (off-git-hosting, i.e. in-memory) merging ALL opened PRs to master and deploying straight to staging server. Any PR can be marked as excluded from a deploy. All PRs are ALWAYS based off master. tldr; master based pull requests as source of release.
We also make a release branch (releases/x.y.z), tag a release candidate in it (x.y.z-RCn), build it, deploy it, wait for a bit (canary stage?), it we did not want to rollback: then merge it into master.
What is the best chatops right now ? I dont see a lot of popularity around chatops. Its most usually some version of github based triggers.
Its funny that Github themselves uses chatops. I think that's a very nice take - especially for early stage startups. Anyone else use anything like it ?
We're just starting beta, but my friend Phil and I both worked together at GitHub and are building what we hope to be a better Hubot at https://ab.bot right now.
It's missing some of the chatops stuff that is mentioned in the blog post but since we support a lot more languages than Hubot we're hoping it's a matter of time before someone in our community builds a better replacement deployment script (or we'll do it while building out sample scripts :))
Can anyone explain why they might go with a slack based deployment system as opposed to something more robust like CircleCI or Jenkins? Is it mainly about the simplicity of it?
As a devops person myself, I am super skeptical that there is any good reason to do a chatops deploy. My guess is "new toys are cool" / "Want this on my resume"
To be clear, it's hopefully just some connector that does slack message -> triggers jenkins job.
But from a security, compliance, reliability, debuggability, auditability perspective I think it's inferior. Not to mention an inferior interface.
> My guess is "new toys are cool" / "Want this on my resume"
Whenever I read comments like this I’m always deeply suspicious of the commenter (is that how you justify trying/adopting tech) or their employer (are they so draconian in tech/design choices that everything is frozen for good). I’m not trying to cast aspersions on you or your employer directly... but it’s fascinating to me to see such a myopic take on a problem space I hope you’d agree is very much not one-approach-fits-all. I’m surprised to hear about their flow too, but my more charitable assumption is that their teams have tried different things and settled on an evolving process that works for them. They’re proud enough to boast it from the corporate blog, it can’t be entirely lark.
chatops deploys aren't really new toys, a place I was at was doing them around 2013/14.
We liked it because the chat history you see is essentially a deploy history, no need to login into some other website to check some obscure logs page to see who did what. We did end up having to debug the service that processed the chat messages maybe once, but never ran into an issue when we had to deploy a hotfix.
What I do is have all jenkins deploys send a record to the #deploys channel (Service X, version Q deployed by person Y completed successfully in Z minutes), which comes for free with a tiny jenkins plugin.
However one of the unicorns I worked at deleted all slack messages after 3 months for legal reasons, as one example. Also, slack has periodic outages.
I think a lot of people underutilize jenkins, but once you're handy with it (and get over its god-awful ui) you never go back.
I really don't see how the chatops approach highlighted in the blog post changes anything. It seems that they're typing a command in Slack, and this triggers a pipeline. Which is something I've been wanting to introduce for our team.
Our UAT environment is deployed with fixed versions, and is only updated either after a sprint, or when the business wants to test new features. Generally this is done by someone from the business asking to deploy a new version, and then a developer manually triggering this process. I see no reason as to why the business wouldn't just be able to do this through Slack, and not have developer act as a middleman.
Yeah where I work product people tend to deploy UAT or QA environments. They do it in jenkins, because a stack consists of multiple services, and they may want to choose which branch to deploy. It would be cumbersome to type a combination of branch names in chat.
However, if your UAT is a manual refresh with no branch names, for example, that seems perfectly reasonable (so long as it's triggering a pipeline like you mentioned).
However if I worked where you worked and you wanted slack to deploy prod, I'd probably try to talk you out of it.
Good to get some insights in this; in our case, we always deploy our develop branch to UAT. Because our releasable code is all the business cares about. We have another SIT environment that we sometimes use for feature branches.
My annoyance at the moment is that business side will often ask "have we deployed the latest code to UAT?", to which I quickly open Jenkins, check when the latest job ran, and revert back to them. I have tried just linking the Jenkins URL back to them, as to say, "look it up yourself". But I suspect business people just don't want to touch Jenkins, because it's a "technical tool". So my idea has always to been build a simple chat bot, where they can ask when the latest deployment was, and where they'll be able to trigger a new UAT deployment.
> However if I worked where you worked and you wanted slack to deploy prod, I'd probably try to talk you out of it.
We have pretty strict deployment processes for prod, where other teams do the deployments, and it's not even allowed to automate that.
What I would debate doing in your use-case is have every merge to "develop" branch automatically trigger a deploy to UAT (another option is to set an autobuild every night). There's a jenkins plugin for that.
My team recently put in automation so that we use CircleCI for the staging deployment, have it wait for manual approval, then deploy to production. However, we can also give the Slack staging deployment message a +1 reaction, which will automatically approve the production deployment for CircleCI. This way, we get an easy dev UX but all the CI features of CircleCI.
It's mainly the simplicity of the deployment system as it's inline and visible, coupled with habit. It all actuality that is just what _can_ trigger the deploy, the actual deploy is based on an internal deploy application and deploys can be triggered from there as well.
There's easy transparency amongst multiple teams, without having accounts for the other teams on CircleCI or Jenkins. This is while the deploy is in flight, and it can provide timestamped logs if there's an incident, and it could be useful for tracking history. It's also clear who kicked off the deploy.
Was some mix of decoupling / parallelizing the Actions service considered?
I believe the value of dogfooding would be immense. Not only could you become the customer (massive reduction to the deploy/measure feedback loop), but it would be a key marketing move.
On top of that, the GUI that will now require critical care and development is essentially a clone of what is offered by Github Actions.
Yes there was. The change would have been quite a lot to do at once, and we aren't ready to safely add that circular dependency yet. We aren't done on this path, and this certainly won't be the last iteration. I suspect that iteration will come at some point after we can figure out the circular dependency issue.
It also makes sense internally. Our actions team is much larger than the team that manages the project I wrote/lead, so it also makes logistical sense IMO
We also started before the CD product was ready on actions and we directly influenced it, so yea haha
That seems like an extremely good idea actually, since if you dogfood your own releasing service then you can't fix it anymore if you accidentally bring down the service.
You just run the previous version of the production stack in your "dogfood/operations" stack. Once you've fully rolled out production and have vetted it, you can upgrade the other one to match production.
With large codebases maintained by many people, sometimes it's difficult to "just" do things like that. It's a bit weird to think that no one within that large group of professional developers hasn't thought of doing a simple obvious solution. It's probably not that simple.
And then you get hit with a subtle bug that utterly nukes your system when the year changes from 2020 to 2021. Now you can't deploy anything because both systems are down. Given Github's scale and number of engineers it's basically guaranteed that they'll hit some sort of bug like that eventually. Not all bugs act immediately.
I did a short stint at wayfair and about 1-2 months in, there was a deploy that somehow got passed the test flow and when deployed took down their entire site. So badly that they couldn't even deploy the fix
It sounds like a good idea, but is very bad one. I worked on a git deploy product and dogfood was very appealing, turned out it's confusing as fuck. And When things get wrong it's even more confusing, it's like horror time travel movie when you can't find origin to fix consequences.
More then likely it's because that's what they used before they got bought and haven't been forced to migrate over yet, they also seem to have bots, which are not really a direct copy and paste into MS Teams, and likely them converting over isn't a high priority.
My understanding of Microsoft policy is that it's easier to buy macbooks for your developers than it is to buy Slack. Which makes sense, because they're currently doing head to head with slack for market share right now, while a few macbooks doesn't threaten their credibility when selling windows.
My guess is that github was using slack before they were bought and inertia is a thing. I'm sure there are people within the parent company that would like to see them transition, but I'm sure there's a ton of resistance, especially "on the ground" at github. Buyouts are a delicate thing, they don't want to ruin github by trying to force it to change too quickly.
No but why would you use a product that is $7 or what ever times the number of employees (so let's say 200, so $1400 a month) when you can use a free one.
$1400 a month is less than a rounding error for a company that size. If you can get even the tiniest bit of extra developer productivity from the software then it is worth it.
And Github will definitely still have to "pay" for Teams, whether that is internal accounting or actual money being exchanged.
Speaking from experience, just because you work for a company doesn't mean you can use all of their products (or that you'll even get favorable pricing on them).
On the other hand sometimes it means you MUST use the company products.
Consulted for a sub-sub-sub-subsidiary of Toshiba. All computer equipment had to be from Toshiba - the closest place to get Toshiba laptops was two COUNTRIES over.
They even had to tape over non-Toshiba branding from external displays that would be visible.
My uncle used to work at Compaq (back before they got bought by HP). When their computers broke, his team had to pay their support staff to get them fixed. (Via internal budgeting). But the support team knew internal customers would call them anyway and it was still compaq’s money, so they charged several times more for internal support calls than normal support calls.
My uncle’s team was having none of that, so they paid an external computer repair service to fix their computers. The external repair service subcontracted to compaq’s internal people anyway, so when their computers broke they called up (and paid) external consultants. Who in turn called compaq’s internal support team, who came downstairs and fixed their computers at a competitive price.
At Microsoft if you build a product using Azure (and if you want to use the cloud you MUST use Azure, you're not going to get approval to write a check to AWS) the costs come out of your budget. And it's taken seriously, to the point where teams will very much emphasize managing costs (what will this new feature cost on our Azure bill? Can we build it more efficiently? Oh wow, that refactor saved us 100k/month in cloud costs, don't forget that when we start talking about promotions...)
That makes sense since the amount you could use is variable. I was thinking more like somebody couldn't get a free word license at a MS subsidiary or something.
When I worked at MS Azure, we had to pay for Azure servers! (I believe our team had a $5k/month Azure bill.) It's part of internal budgeting, so that people within MS don't splurge on expensive things (because it does cost MS money for each person on Teams).
Very mundane I'm afraid. Worked for a MS subsidiary, on an online game for xbox and PC. Developed on windows, using visual studio and deployed on azure, used TFS for bug tracking. All of the above costs were tracked rigorously, and charged to the project. Most frustrating was complying with the visual studio licenses across the board, with no assistance from the licensing team. We had an account manager for all of the above but my understanding is that he was more of an auditor than anything else.
Unrelated to software but the company my dad works for (motor repair) has to buy all its parts from its own distribution arm, at the marked up price. He then has to turn a profit on those parts as well as pricing the labour.
If cost price is £5 and the markup is 20%, he has to pay £6 to get the part, then charge £7.20 on the invoice to the customer. I’ll let you guess what that does to tender bids ;-)
This is generally a good flow, but something that absolutely baffles me is that GitHub changes the commit SHAs when branches are rebase-merged from PRs[0]. This totally breaks a fundamental notion in Git that the same work, based on the same commits, has the same hash. It also makes it incredibly difficult to determine which PR branches have been merged into master.
That is not something Github is doing, it's fundamental to how git works that different commits have different hashes - and rebasing creates different commits (they have different parents).
Not rebase-merging would probably suit your workflow better.
I really don't think this is a "fundamental notion in Git." They ship git-patch-id to do what you're trying to do and frequently used internal tools like git-cherry use it. It's also not a true statement for LKML-like workflows that are landing patches off of mailing lists with git-am or git-apply.
a ton of internal debug logs :) We log all state machine transitions, poll the state machine to make sure it stays running, and do a ton of checks in the process. The debug button is a timeline of all of that. Basically trace-level logs
It's actually longer than 5 minutes. There is the duration of the 2% canary deploy where we start to see pick up of traffic, a 5 minute wait, then a 20% "deploy", and a 5 minute wait. All in all this comes out to around 10-15ish minutes in canary. This is a stage where we can almost instantly shut off traffic to the canary deploy.
Could we reduce risk by lengthening the process? Maybe, but you also make deploys longer which means less stuff can get through in a day. This makes devs respond with larger PRs, for example, which increases the risk profile.
So we need to balance time and duration. Typically large problems will manifest quickly, or take a lot longer to detect (and thus are generally more minor problems) when you have our scale of a user base in my experience.
A lot of alerts use moving averages or sustain times to squelch transient noise. You have to wait for the max sustain time to pass before you can conclude that lack of alert = lack of problem.
That time could very well be 5 minutes but the two need to be coordinated.
That's pretty awesome to go from nothing to full production in 15 minutes. I would like to encourage others to bear in mind that simply adding more time wouldn't significantly decrease the risk of things going wrong.
Does anyone know the pros and cons of GitHub's approach?