On Dec 6th, AMD released the MI300x. This is effectively their entry into the AIA (ai accelerator is the new GPU) market. The previous generation (MI250) cards were good, but this new release leapfrogs everything. These cards offer 192gb of ram and pretty great performance. 8 cards in a chassis. Great for training and inference.
Sure, we all know that the AMD software needs work, but I think the long bet here is on Lisa. Reminds me of the Mac vs. Windows days. If NVIDIA owns this much of the market, then I'd want to work to decentralize my AI business off a single point of failure.
So, I've been working to launch a business to buy up as many of these chips as I can and make them available as bare metal rentals, even with IPMI access. This is something that hasn't been done before. Usually these sorts of AMD cards end up in super computers (ie: Frontier) and/or only accessible to a few people. Azure is catching on with their recent product, but that is a big cloud... I believe people will still want private clouds.
To start with, we're loading up the MI300x chassis with tons of RAM/nvme, top end AMD CPUs and dual 400G networking. We've also got top end management servers for Ray/k8s/slurm. We are interested in feedback on what people want and can be agile enough to customize our purchases for your needs.
My background is that I built a cluster of 150k AMD GPUs across 7 data centers. Deploying and running a lot of compute is something I've gotten very good at. Feel free to reach out.
"needs work" is an understatement. ROCM only just came out for windows(!) and so far everyone in the LLM and SD scene says that ROCM on linux is a pain in the ass to get working, requires a lot of hackery and override environment variables, is buggy, and also only supports a small handful of cards - whereas CUDA runs on nearly everything NVIDIA makes, from laptop GPUs to datacenter compute cards. And there's very little support on the third party software side at the moment.
That "no consumer hardware support" sounds stupid, but people getting into ML/AI, grad students, etc want to be able to mess around and develop/prototype on local hardware they already own.
You cannot do that with ROCM, because they only support a very, very small number of consumer-level cards - and very oddly, it's a mix of the very high end and low end, nothing in the middle.
AMD is also massively behind hardware-wise with only the current 7xxx series cards that just came out having AI-specific hardware, whereas NVIDIA has had tensor cores for three generations of their cards.
But of course AMD can't do anything right, so they nerfed the GPU's graphics processing cores when they added the AI stuff, making the cards a worse deal from a pure gaming standpoint. The 7000 series cars are basically a tiny bump at best, and in many games worse, than the 6000 series equivalents. Their only advantage is better hardware video encoding support and somewhat better power consumption.
All of what you're saying is true and I can't argue that.
The only difference is that previously you couldn't even rent time on these super high end AIA's. They were reserved for supercomputers only. Even the MI250x sku is government/research only and even I am unable to buy that over the MI250.
So, we need to work on building that developer flywheel. My view is that if you can rent some time on one of these systems, that is a good first step. Nowhere else can you load a model into 192gb of RAM on a top end system.
I think that PyTorch is part of the puzzle and it certainly helps that it is supported by AMD [0]. That said, there is code that needs to run closer to the metal too.
It was very difficult for me to get access to these chips and I'm building a whole business around it. It won't be possible for individuals for a long long time, if ever. Maybe in a few years on the secondary market.
Case in point... try searching ebay for the previous generation... MI250.
If you want access to these chips, you're going to have to rent them. Given that nobody is really renting them currently, at least I'm working to get you access to them at all.
Correct me if I'm wrong, but I don't think the US government can take over a business like that. They can take the land where we host the equipment, though if that happens, I guess I'd just move it somewhere else?
I did personally sign some documentation (EAR) saying that I wouldn't export the equipment to a whole list of countries and I definitely won't do that. I also won't rent to anyone in those countries either, just to cover my bases. I have high ethics and have no desire to get in any sort of trouble.
Exciting news. How's software story on this? Currently we are on CUDA, etc. We have nice big AMD Epycs (1GB of cache!) but use Nvidia for GPU. Mi300x at $40k would be tempting, but hard to justify with porting effort required.
I would start to look into using ROCm/HIP to port the code. Some of it can be done automatically and some will need customizations [0]. Then, you rent time on my systems and see what the performance is. If it is good, you rent more, for longer to bring the pricing down. Effectively, we are taking the capex/opex risk off your shoulders. The key point here is that we have actual availability and we aren't just reselling someone else's hardware.
Thanks! Because we are just starting out, in order to stay lean, we are focused on the hardware side at this point. The capex/opex involved with deploying this level of compute is rather insane and I feel that it will be too distracting to solve both problems at the same time.
I do fully recognize that in order to build the developer flywheel, we need to solve both problems in the long run. For now, there are other great companies that are focused on solving the software side, such as NeuralFabric, EmbeddedLLM, Lamini and MK1. Our intention is to partner and be friends with all of them. There is a lot of room in the picks/shovels ecosystem for all of us.
IIRC, Intel peaked at 98% of the Data Center CPU Market. AMD released an arguably superior CPU almost 7 years ago, but Intel still has 70%.
You might be able to argue that MI300 is competitive, but it is certainly hard to argue it's superior, so we're not yet at the GPU equivalent of the 2017 release of EPYC.
Intel also charged a competitive price for CPUs, recent NVidia chips are charged at a premium rate not seen in semiconductor for decades. I'd venture the margin on an H100 @65k is probably ~98%. If the market for datacenter GPUs rises to ~30 million deployed GPUs and ~10 Million new chips per year... that would be 192 billion in profit/year.
I'd be skeptical that NVidia can maintain that margin, or price point. Had they owned ARM it may have been possible, as new entrants to the chip market could have been locked out.
Industry estimate is that Nvidia market share declines 75% in 2027-2028.
If the AI hype is real, then Nvidia revenue is $300 billion in 5 years. Assuming P/E 25 would mean $7.5 trillion valuation. That sounds insane.
If hype is 1/3 true, Nvidia grows 17% per year and is valued $2.5 trillion in five years.
ps. Intel vs. AMD is not completely symmetric competition because AMD is fabless and Intel is not. While AMD gets better margins, Intel can produce volumes to match demand. AMD competes directly with Apple, Nvidia, ARM,.. for TSMC fab capacity (I know Intel also buys manufacturing from TSMC).
But that's using a P/S multiple, not P/E. Your math makes sense if you assume Nvidia will have 25x P/S... which historically is very high, and unlikely to be the "correct" multiple after all of that growth has already occurred. Unless you expect them to continue growing at same CAGR in perpetuity.
Also, very unlikely they will maintain that kind of market share if they reach those valuations. A relatively simple software solution to improve integration with AMD/Intel/TPU offerings that is worth trillions is a bit of a no brainer. It wasn't a financial prerogative until recently, so yes, all the competitor software here sucks, today.
Compute is fungible, despite what many say here. The cost differential between implementing good software support for AMD and paying Nvidia ~98% gross margin makes it kind of obvious.
A bull case could still see NVDA at a few trillion 5-10 years out... very unlikely it would go much beyond that. I fail to see any remotely rationally made, objective, case for it. At least not one that warrants investing in Nvidia today, at the current valuation, versus other opportunities
It can still be a multi-bagger in the short-term due to what looks like dotcom 2.0 enthusiasm.
Until Intel actually releases a discrete GPU on an Intel process instead of the same TSMC everyone else is using their fab advantage remains a theoretical one in this space. It is to bad they didn't have even a power sucking discrete GPU available built on one of their older processes during the previous GPU shortage since they'd at least have a strong install base right now by simple virtue of being able to put out a large supply.
They are not fixing the software though, and they never do. You have this problem with every new generation, it launches broken and never gets fixed by the time it is obsolete. If you told me they fixed it, I would not even believe you.
It doesn't matter if you believe the first person who figures it out. Ten other people will try it and then there will be ten people saying they fixed it and a hundred people will try it. At some point you either believe everyone who has tried it or you lose business because your competitors have figured it out and you haven't.
Ten people saying “it works now” is not going to lead to ten times that many risking their time and money on AMD. “Works for me” is a meme for a reason.
The number of people who believe you is the average probability that someone believes you times number of people you tell. In the bronze age this was really bad because you had to travel on horseback and contacting people in the same industry in other cities was quite slow and burdensome. Now we have the internet and communicating with someone on the other side of the world takes 200ms, so the speed at which new information spreads is fast.
Meanwhile those people then know each other, so as soon as someone credible claims it works, several other credible believe them, try it themselves, and say the same thing. It doesn't really matter if you think it's 10 or 100 or 1.2, it's exponential spread at the speed of the internet.
> Reputation is basically meaningless here. If they fix the software, people will figure that out in short order.
What matters is not just "the software is working", but also "the software will keep working" (for both new releases and new hardware). Reputation is meaningful for that.
That's not usually a problem for software, especially this kind of software. New drivers and libraries for existing hardware are typically bug fixes because otherwise people wouldn't install them instead of using the existing code. The code for new hardware is generally based on the existing code, so if the existing code is good then it's likely to remain that way. Also, if the new hardware isn't any good then you just don't buy it.
This is why anyone making long-term investments in hardware-related software is wise to write it in a vendor-agnostic way -- even if the software continues to be good you may have other reasons to want to switch hardware in the next generation.
One is, can I just buy this thing and run the existing somebody else's code on it and it will work and be faster than the previous generation? If it isn't, you just don't buy the new hardware until that changes.
The other is, you're going to buy a thousand of them or you're making software for the general public to use and you want to optimize it for each specific generation. But in that case you're redoing the work for each generation regardless and you don't have to be concerned about your current efforts carrying forward because you already know that they won't.
This illustrates the power of building a software ecosystem that supports the hardware you are building.
Nvidia has committed to CUDA and created an ecosystem. This has been a long term commitment. Meanwhile, Intel (new libaries/paradigms all the time) and AMD (software/drivers/libraries are an afterthought) have struggled in this arena.
Strangely Intel manages to do this pretty well in other markets. For example for high-performance computing they have well established libraries like the mkl and, a Fortran compiler and a great C++ compiler. But when it comes to GPUs or accelerator cards they lack commitment, with most projects fizzling out.
Maybe some regulatory body could declare this an obvious 'monopoly' and at least force the 'language spoken' to be free for anyone to implement without patent worries?
Nvidia has monopoly in the market, but unless they use their position illeagally, that's OK. GPU market is competitive and Nvidia has put in the work others like AMD or Intel never bothered to do.
While this has mostly held, there’s a big push to redefine anti-trust. That’s what the MSFT-Activision case was actually about beneath the surface narrative. The challenger jurisprudence lost, but it got pretty close and has lots of political support.
(see: Lina Khan’s management of the FTC, or “New Brandeis movement”)
The push is not to make monopoly (defined as market share) illeagal.
Lina Khan and others are fighting good fight against misuse of monopoly powers, two sided markets, anticompetitive effects of platform-based business models and so on.
GPU market is not natural monopoly, nor is the market structured as monopoly. Nvidia just has a competitive advantage.
OpenCL is only about a year younger than Cuda. Nobody stopped AMD from investing into it. But where Nvidia spent the last 16 years continuously investing into providing the best performance and development experience, AMD rarely invested into either. Instead they focused on video games, with some projects that helped them profit from the crypto hype but lacked the necessary commitment to achieve much of anything beyond that.
16 years ago AMD was on the verge of bankruptcy and was already smaller than Nvidia despite being a company that makes GPUs and CPUs and, at the time, had their own fabs. They had no money to invest in anything, and that didn't really change until Ryzen. You may be able to identify its release date from this graph:
It runs a bit deeper than than money though. If a company doesn't have software as part of their culture, no amount of money is going to turn that around until they make a conscious decision to do so.
AI has been the first technology to really kick them in the pants to get off their asses and really invest in that aspect. There are a ton of job openings around AI software now within AMD. I'm seeing new ones get posted almost daily on LI. It'll take time, but they will fix it.
I think this is the right answer, NVidia struck a gold mine with CUDA. It's powerful enough not only for games, graphics and creatives; but could somehow also be easily retrofitted to work in other scenarios such as data center and AI / ML (language and image processing). AMD and Intel's software solutions seem to be pretty lacking and it looks harder to get a full end-to-end AI/ML stack working efficiently on those.
Dominating a market is not enough for a monopoly declaration. you’d need to prove some sort of anti competitive practice or similar abuse of said dominance
It will take a minimum of 5-10 years to see a meaningful change, if it's not already too late.
Absolutely nothing will change about the context in two years, other than Nvidia will be far stronger at that point with $50+ billion per year in operating income (formally joining the hyper profitable US tech giants).
It's going to be extraordinarily expensive to build in the GPU data center space at scale. Nvidia will have the money to do it and most of their smaller competitors will not (that includes AMD, which doesn't generate anywhere near enough cash to outlay billions of dollars in a single high risk capital investment pointed at the GPU data center space; AMD's operating income over the last four quarters is negative, and they have a mere $3b in cash). Nvidia will use their new cash spigot to build infrastructure moats in the space. Large governments, Microsoft, Apple, Meta, Google, TSMC are the only entities capable of keeping up on spending.
There is a lot of space for competing on price though. AMD or even Intel could release a datacenter GPU that's 3x slower than the H100 while using 2x more power and still sell it with a very decent margin. Of course that will likely mean that Nvidia will just cut prices to stay competitive and maintain 90%+ market share but their obscene margins will have to come down.
Investments have diminishing returns. Once you've spent a threshold amount of money, more is just incremental gains, until you're not even getting more out of it than you could by sticking the money in the S&P 500.
It's also weird that you're looking at operating income, which already has R&D expenses subtracted out of it. A company spending more of its revenue on R&D will have a lower operating income, but that hardly implies they can't spend on R&D -- because they are.
And customers hate moats. In many cases they get stuck with them, but most often this is individual consumers with no resources to do anything about it. Data center customers are large institutions, often with their own R&D budgets. Three of the companies that each have a larger market cap than Nvidia are Amazon, Google and Microsoft, the three largest cloud providers. Is any of them interested in letting Nvidia have a moat?
> It will take a minimum of 5-10 years to see a meaningful change, if it's not already too late.
It isn't too late and I'm betting it will be closer to 5 than 10.
> It's going to be extraordinarily expensive to build in the GPU data center space at scale.
It already is expensive. But it isn't just NVIDIA doing it. CoreWeave is the largest and certainly backed by NVIDIA, but they are a separate business. AMD doesn't have to do it on their own either.
Geohotz is baby Musk. He'll eventually find an area where he can put his brain to ideal use and change the world a bit. He's obviously sharp, but hasn't found the niche that works for him. Musk is obviously a transportation guy. Maybe Hotz will end up in Infra?
We need more anti-anticompetitive action. Why should NVidia and those who use their products be punished because their competitors have not stepped up to the plate?
Yeah, NVidia could lose market share (likely slowly), but the data center GPU TAM is growing significantly, so it's likely lifting all boats, but NVidia faster than its competitors.
That is great, did someone ask the programmers about it though? Or did someone else see the higher FLOPs per dollar and bought in? If your job is literally programming a supercomputer and you can get AMD to fix stuff for you, maybe it aint so bad. That is not where most software comes from though.
Sure, we all know that the AMD software needs work, but I think the long bet here is on Lisa. Reminds me of the Mac vs. Windows days. If NVIDIA owns this much of the market, then I'd want to work to decentralize my AI business off a single point of failure.
So, I've been working to launch a business to buy up as many of these chips as I can and make them available as bare metal rentals, even with IPMI access. This is something that hasn't been done before. Usually these sorts of AMD cards end up in super computers (ie: Frontier) and/or only accessible to a few people. Azure is catching on with their recent product, but that is a big cloud... I believe people will still want private clouds.
To start with, we're loading up the MI300x chassis with tons of RAM/nvme, top end AMD CPUs and dual 400G networking. We've also got top end management servers for Ray/k8s/slurm. We are interested in feedback on what people want and can be agile enough to customize our purchases for your needs.
My background is that I built a cluster of 150k AMD GPUs across 7 data centers. Deploying and running a lot of compute is something I've gotten very good at. Feel free to reach out.