Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Analysts estimate Nvidia owns 98% of the data center GPU market (extremetech.com)
67 points by giuliomagnifico on Feb 1, 2024 | hide | past | favorite | 83 comments


On Dec 6th, AMD released the MI300x. This is effectively their entry into the AIA (ai accelerator is the new GPU) market. The previous generation (MI250) cards were good, but this new release leapfrogs everything. These cards offer 192gb of ram and pretty great performance. 8 cards in a chassis. Great for training and inference.

Sure, we all know that the AMD software needs work, but I think the long bet here is on Lisa. Reminds me of the Mac vs. Windows days. If NVIDIA owns this much of the market, then I'd want to work to decentralize my AI business off a single point of failure.

So, I've been working to launch a business to buy up as many of these chips as I can and make them available as bare metal rentals, even with IPMI access. This is something that hasn't been done before. Usually these sorts of AMD cards end up in super computers (ie: Frontier) and/or only accessible to a few people. Azure is catching on with their recent product, but that is a big cloud... I believe people will still want private clouds.

To start with, we're loading up the MI300x chassis with tons of RAM/nvme, top end AMD CPUs and dual 400G networking. We've also got top end management servers for Ray/k8s/slurm. We are interested in feedback on what people want and can be agile enough to customize our purchases for your needs.

My background is that I built a cluster of 150k AMD GPUs across 7 data centers. Deploying and running a lot of compute is something I've gotten very good at. Feel free to reach out.


"needs work" is an understatement. ROCM only just came out for windows(!) and so far everyone in the LLM and SD scene says that ROCM on linux is a pain in the ass to get working, requires a lot of hackery and override environment variables, is buggy, and also only supports a small handful of cards - whereas CUDA runs on nearly everything NVIDIA makes, from laptop GPUs to datacenter compute cards. And there's very little support on the third party software side at the moment.

That "no consumer hardware support" sounds stupid, but people getting into ML/AI, grad students, etc want to be able to mess around and develop/prototype on local hardware they already own.

You cannot do that with ROCM, because they only support a very, very small number of consumer-level cards - and very oddly, it's a mix of the very high end and low end, nothing in the middle.

AMD is also massively behind hardware-wise with only the current 7xxx series cards that just came out having AI-specific hardware, whereas NVIDIA has had tensor cores for three generations of their cards.

But of course AMD can't do anything right, so they nerfed the GPU's graphics processing cores when they added the AI stuff, making the cards a worse deal from a pure gaming standpoint. The 7000 series cars are basically a tiny bump at best, and in many games worse, than the 6000 series equivalents. Their only advantage is better hardware video encoding support and somewhat better power consumption.


All of what you're saying is true and I can't argue that.

The only difference is that previously you couldn't even rent time on these super high end AIA's. They were reserved for supercomputers only. Even the MI250x sku is government/research only and even I am unable to buy that over the MI250.

So, we need to work on building that developer flywheel. My view is that if you can rent some time on one of these systems, that is a good first step. Nowhere else can you load a model into 192gb of RAM on a top end system.


For AI you have to support PyTorch. That works well now on the big AMD GPUs. Consumer gpu capabilities do not matter here.


I think that PyTorch is part of the puzzle and it certainly helps that it is supported by AMD [0]. That said, there is code that needs to run closer to the metal too.

[0] https://pytorch.org/blog/experience-power-pytorch-2.0/


>> We are interested in feedback on what people want and can be agile enough to customize our purchases for your needs.

I would like you to not by pallets of these chips so that I and others may be able to get our hands on a couple for our own personal projects.


It was very difficult for me to get access to these chips and I'm building a whole business around it. It won't be possible for individuals for a long long time, if ever. Maybe in a few years on the secondary market.

Case in point... try searching ebay for the previous generation... MI250.

If you want access to these chips, you're going to have to rent them. Given that nobody is really renting them currently, at least I'm working to get you access to them at all.


Was it difficult though because you lack the credentials of an established large corporation, or because there is very low stock at this point?


If NVIDIA has 98% of the market, I don't think stock is an issue. ;-)

The chip industry runs on relationships.


What’s to stop the government from taking your business from you for national security reasons?


Correct me if I'm wrong, but I don't think the US government can take over a business like that. They can take the land where we host the equipment, though if that happens, I guess I'd just move it somewhere else?

I did personally sign some documentation (EAR) saying that I wouldn't export the equipment to a whole list of countries and I definitely won't do that. I also won't rent to anyone in those countries either, just to cover my bases. I have high ethics and have no desire to get in any sort of trouble.


He's not in China


Very very true.


Exciting news. How's software story on this? Currently we are on CUDA, etc. We have nice big AMD Epycs (1GB of cache!) but use Nvidia for GPU. Mi300x at $40k would be tempting, but hard to justify with porting effort required.


I would start to look into using ROCm/HIP to port the code. Some of it can be done automatically and some will need customizations [0]. Then, you rent time on my systems and see what the performance is. If it is good, you rent more, for longer to bring the pricing down. Effectively, we are taking the capex/opex risk off your shoulders. The key point here is that we have actual availability and we aren't just reselling someone else's hardware.

Dual 9754's are indeed quite nice. ;-)

[0] https://github.com/ROCm/HIPIFY


Thank you for the link and advice. We're using 9684X for the cache. But dual 9754 does sound good!


Oh yea, great chip as well. That cache certainly is expensive! Can I ask... what is the usecase for the cache?


We use the chip general purpose. HFT simulation code, empirically faster.

The GPUs are to train a model. Separate use-case, but we use same hardware for both use-cases so that's why the big cache thing.


that's awesome! what support are you giving for the rewriting that has to happen for things to run on ROCm or whatever instead of CUDA?


Thanks! Because we are just starting out, in order to stay lean, we are focused on the hardware side at this point. The capex/opex involved with deploying this level of compute is rather insane and I feel that it will be too distracting to solve both problems at the same time.

I do fully recognize that in order to build the developer flywheel, we need to solve both problems in the long run. For now, there are other great companies that are focused on solving the software side, such as NeuralFabric, EmbeddedLLM, Lamini and MK1. Our intention is to partner and be friends with all of them. There is a lot of room in the picks/shovels ecosystem for all of us.


IIRC, Intel peaked at 98% of the Data Center CPU Market. AMD released an arguably superior CPU almost 7 years ago, but Intel still has 70%.

You might be able to argue that MI300 is competitive, but it is certainly hard to argue it's superior, so we're not yet at the GPU equivalent of the 2017 release of EPYC.


Intel also charged a competitive price for CPUs, recent NVidia chips are charged at a premium rate not seen in semiconductor for decades. I'd venture the margin on an H100 @65k is probably ~98%. If the market for datacenter GPUs rises to ~30 million deployed GPUs and ~10 Million new chips per year... that would be 192 billion in profit/year.

I'd be skeptical that NVidia can maintain that margin, or price point. Had they owned ARM it may have been possible, as new entrants to the chip market could have been locked out.


> I'd venture the margin on an H100 @65k is probably ~98%.

You are not far off.

Nvidia spends $3,320 to manufacture H100 unit, says Raymond James Financial, Inc. analysis.


Industry estimate is that Nvidia market share declines 75% in 2027-2028.

If the AI hype is real, then Nvidia revenue is $300 billion in 5 years. Assuming P/E 25 would mean $7.5 trillion valuation. That sounds insane.

If hype is 1/3 true, Nvidia grows 17% per year and is valued $2.5 trillion in five years.

ps. Intel vs. AMD is not completely symmetric competition because AMD is fabless and Intel is not. While AMD gets better margins, Intel can produce volumes to match demand. AMD competes directly with Apple, Nvidia, ARM,.. for TSMC fab capacity (I know Intel also buys manufacturing from TSMC).


P/E is a ratio on earnings, not sales.

25 * $300B (revenue) = $7.5T, yes.

But that's using a P/S multiple, not P/E. Your math makes sense if you assume Nvidia will have 25x P/S... which historically is very high, and unlikely to be the "correct" multiple after all of that growth has already occurred. Unless you expect them to continue growing at same CAGR in perpetuity.

Also, very unlikely they will maintain that kind of market share if they reach those valuations. A relatively simple software solution to improve integration with AMD/Intel/TPU offerings that is worth trillions is a bit of a no brainer. It wasn't a financial prerogative until recently, so yes, all the competitor software here sucks, today.

Compute is fungible, despite what many say here. The cost differential between implementing good software support for AMD and paying Nvidia ~98% gross margin makes it kind of obvious.

A bull case could still see NVDA at a few trillion 5-10 years out... very unlikely it would go much beyond that. I fail to see any remotely rationally made, objective, case for it. At least not one that warrants investing in Nvidia today, at the current valuation, versus other opportunities

It can still be a multi-bagger in the short-term due to what looks like dotcom 2.0 enthusiasm.


Until Intel actually releases a discrete GPU on an Intel process instead of the same TSMC everyone else is using their fab advantage remains a theoretical one in this space. It is to bad they didn't have even a power sucking discrete GPU available built on one of their older processes during the previous GPU shortage since they'd at least have a strong install base right now by simple virtue of being able to put out a large supply.


Intel vs. AMD was abut CPU's not GPUS. Intel still has > 70% market share in server CPUs where it once had almost 99% market share.


Intel never had a software moat. AMD has a terrible reputation for software that will be hard to shake.


Reputation is basically meaningless here. If they fix the software, people will figure that out in short order. They just have to actually do it.


They are not fixing the software though, and they never do. You have this problem with every new generation, it launches broken and never gets fixed by the time it is obsolete. If you told me they fixed it, I would not even believe you.


It doesn't matter if you believe the first person who figures it out. Ten other people will try it and then there will be ten people saying they fixed it and a hundred people will try it. At some point you either believe everyone who has tried it or you lose business because your competitors have figured it out and you haven't.


Ten people saying “it works now” is not going to lead to ten times that many risking their time and money on AMD. “Works for me” is a meme for a reason.


The number of people who believe you is the average probability that someone believes you times number of people you tell. In the bronze age this was really bad because you had to travel on horseback and contacting people in the same industry in other cities was quite slow and burdensome. Now we have the internet and communicating with someone on the other side of the world takes 200ms, so the speed at which new information spreads is fast.

Meanwhile those people then know each other, so as soon as someone credible claims it works, several other credible believe them, try it themselves, and say the same thing. It doesn't really matter if you think it's 10 or 100 or 1.2, it's exponential spread at the speed of the internet.



> Reputation is basically meaningless here. If they fix the software, people will figure that out in short order.

What matters is not just "the software is working", but also "the software will keep working" (for both new releases and new hardware). Reputation is meaningful for that.


That's not usually a problem for software, especially this kind of software. New drivers and libraries for existing hardware are typically bug fixes because otherwise people wouldn't install them instead of using the existing code. The code for new hardware is generally based on the existing code, so if the existing code is good then it's likely to remain that way. Also, if the new hardware isn't any good then you just don't buy it.

This is why anyone making long-term investments in hardware-related software is wise to write it in a vendor-agnostic way -- even if the software continues to be good you may have other reasons to want to switch hardware in the next generation.


I don't believe this is true competitive numeric software stack, like AI applications are.

1. Code must be hand optimized, refactored for each architecture while the API stays the same.

2. New APIs, algorithms must be backported and optimized for older hardware.

That's insane amount of testing and performance tuning.


So there are two things you might want here.

One is, can I just buy this thing and run the existing somebody else's code on it and it will work and be faster than the previous generation? If it isn't, you just don't buy the new hardware until that changes.

The other is, you're going to buy a thousand of them or you're making software for the general public to use and you want to optimize it for each specific generation. But in that case you're redoing the work for each generation regardless and you don't have to be concerned about your current efforts carrying forward because you already know that they won't.


Intel is very good at convincing you to stay with them, by essentially subsidizing your servers do you don’t get used to working with amd.


This illustrates the power of building a software ecosystem that supports the hardware you are building.

Nvidia has committed to CUDA and created an ecosystem. This has been a long term commitment. Meanwhile, Intel (new libaries/paradigms all the time) and AMD (software/drivers/libraries are an afterthought) have struggled in this arena.


Strangely Intel manages to do this pretty well in other markets. For example for high-performance computing they have well established libraries like the mkl and, a Fortran compiler and a great C++ compiler. But when it comes to GPUs or accelerator cards they lack commitment, with most projects fizzling out.


It would be nice if...

Maybe some regulatory body could declare this an obvious 'monopoly' and at least force the 'language spoken' to be free for anyone to implement without patent worries?


Monopolies are not illegal per se.

Misusing monopoly power is illegal.

Nvidia has monopoly in the market, but unless they use their position illeagally, that's OK. GPU market is competitive and Nvidia has put in the work others like AMD or Intel never bothered to do.


This is it. Having a monopoly is OK. Being anticompetitive is when you run into trouble, even potentially without a monopoly.


Likely part of the move to merge their drivers into the linux kernel etc.


Putting drivers in the Linux kernel isn't anticompetitive. Saying they're the only drivers allowed in the Linux kernel would be anticompetitive.


Putting them in an opensource os also helps demonstrate that they aren’t trying to be anti competitive.


While this has mostly held, there’s a big push to redefine anti-trust. That’s what the MSFT-Activision case was actually about beneath the surface narrative. The challenger jurisprudence lost, but it got pretty close and has lots of political support.

(see: Lina Khan’s management of the FTC, or “New Brandeis movement”)


The push is not to make monopoly (defined as market share) illeagal.

Lina Khan and others are fighting good fight against misuse of monopoly powers, two sided markets, anticompetitive effects of platform-based business models and so on.

GPU market is not natural monopoly, nor is the market structured as monopoly. Nvidia just has a competitive advantage.


Legal? Sure.

OK? I sincerely disagree.


What about Alcoa?


OpenCL is only about a year younger than Cuda. Nobody stopped AMD from investing into it. But where Nvidia spent the last 16 years continuously investing into providing the best performance and development experience, AMD rarely invested into either. Instead they focused on video games, with some projects that helped them profit from the crypto hype but lacked the necessary commitment to achieve much of anything beyond that.


16 years ago AMD was on the verge of bankruptcy and was already smaller than Nvidia despite being a company that makes GPUs and CPUs and, at the time, had their own fabs. They had no money to invest in anything, and that didn't really change until Ryzen. You may be able to identify its release date from this graph:

https://www.macrotrends.net/stocks/charts/AMD/amd/market-cap

Now they have to take the money and use it to fix their software. It's pretty obvious why they didn't do that in 2008.


It runs a bit deeper than than money though. If a company doesn't have software as part of their culture, no amount of money is going to turn that around until they make a conscious decision to do so.

AI has been the first technology to really kick them in the pants to get off their asses and really invest in that aspect. There are a ton of job openings around AI software now within AMD. I'm seeing new ones get posted almost daily on LI. It'll take time, but they will fix it.


I think this is the right answer, NVidia struck a gold mine with CUDA. It's powerful enough not only for games, graphics and creatives; but could somehow also be easily retrofitted to work in other scenarios such as data center and AI / ML (language and image processing). AMD and Intel's software solutions seem to be pretty lacking and it looks harder to get a full end-to-end AI/ML stack working efficiently on those.


They offered what people where looking for, way to program their GPU that wasn't C.

C++, Fortran, anything else with a compiler toolchain against PTX, graphical debugging and IDE integration, and a library ecosystem.

Khronos was sceptical that Fortran was even meaningful, and SPIR only came up after NVIDIA was approaching the finish line.

Intel and AMD only have themselves to blame.


Dominating a market is not enough for a monopoly declaration. you’d need to prove some sort of anti competitive practice or similar abuse of said dominance


It would be nice if AMD got their act together a couple of years ago and wrote good software and drivers, you mean?

This lead won't last forever. Everyone in the industry has a vested interest in this collapsing, and there are ongoing assaults from all angles.

George Hotz' new company is working to undo this, and there are lots of others. All potential AMD acquisition targets, too.

There are lots of little OpenCL-like projects, and consensus is starting to form.

Then there's the growing TPU market...

Give it two years.


It will take a minimum of 5-10 years to see a meaningful change, if it's not already too late.

Absolutely nothing will change about the context in two years, other than Nvidia will be far stronger at that point with $50+ billion per year in operating income (formally joining the hyper profitable US tech giants).

It's going to be extraordinarily expensive to build in the GPU data center space at scale. Nvidia will have the money to do it and most of their smaller competitors will not (that includes AMD, which doesn't generate anywhere near enough cash to outlay billions of dollars in a single high risk capital investment pointed at the GPU data center space; AMD's operating income over the last four quarters is negative, and they have a mere $3b in cash). Nvidia will use their new cash spigot to build infrastructure moats in the space. Large governments, Microsoft, Apple, Meta, Google, TSMC are the only entities capable of keeping up on spending.


There is a lot of space for competing on price though. AMD or even Intel could release a datacenter GPU that's 3x slower than the H100 while using 2x more power and still sell it with a very decent margin. Of course that will likely mean that Nvidia will just cut prices to stay competitive and maintain 90%+ market share but their obscene margins will have to come down.


Investments have diminishing returns. Once you've spent a threshold amount of money, more is just incremental gains, until you're not even getting more out of it than you could by sticking the money in the S&P 500.

It's also weird that you're looking at operating income, which already has R&D expenses subtracted out of it. A company spending more of its revenue on R&D will have a lower operating income, but that hardly implies they can't spend on R&D -- because they are.

And customers hate moats. In many cases they get stuck with them, but most often this is individual consumers with no resources to do anything about it. Data center customers are large institutions, often with their own R&D budgets. Three of the companies that each have a larger market cap than Nvidia are Amazon, Google and Microsoft, the three largest cloud providers. Is any of them interested in letting Nvidia have a moat?


> It will take a minimum of 5-10 years to see a meaningful change, if it's not already too late.

It isn't too late and I'm betting it will be closer to 5 than 10.

> It's going to be extraordinarily expensive to build in the GPU data center space at scale.

It already is expensive. But it isn't just NVIDIA doing it. CoreWeave is the largest and certainly backed by NVIDIA, but they are a separate business. AMD doesn't have to do it on their own either.


> George Hotz' new company is working to undo this

Is that before or after George Hotz solves homelessness and drug addiction and saves democracy (https://news.ycombinator.com/item?id=39206959)


Geohotz is baby Musk. He'll eventually find an area where he can put his brain to ideal use and change the world a bit. He's obviously sharp, but hasn't found the niche that works for him. Musk is obviously a transportation guy. Maybe Hotz will end up in Infra?


Reading about his 8-ish hour stint at post-Musk-acquisition Twitter was really funny.


Yeah, it was pretty clear that wasn't gonna work.


Nah, that might offshore the dominancy.


So they'd give up the US market? That doesn't seem likely to happen.


Or it would be nice if the regulatory body would fuck off


We need more anti monopoly action, not less.


We need more anti-anticompetitive action. Why should NVidia and those who use their products be punished because their competitors have not stepped up to the plate?


No one is being punished here. Where did anyone talk about prison or criminal charges?

Company profits are not the top priority, if it stuffers so that actually important values are preserved, there's no problem with that at all.


Exactly what Xirgil said, Nvidia shouldn't be punished for being good at their job


When 2.5D and 3D packaging is the bottleneck, what are the companies that produce CoWoS materials and equipment?


As for their market share, it’s only down from here.


As for just the market however, sure is looking like it's going up from here.


Yeah, NVidia could lose market share (likely slowly), but the data center GPU TAM is growing significantly, so it's likely lifting all boats, but NVidia faster than its competitors.


AMD investors must be salivating at taking some of that market share


AMD should have laid the groundwork a decade ago, instead they basically gave up. Who needs GPGPU anyway, right?


> Who needs GPGPU anyway, right?

The world's largest computer, Frontier at oak ridge national laboratory, runs AMD GPUs. AMD is undisputably the #2 in the GPU space.

https://www.top500.org/lists/top500/2023/11/


That is great, did someone ask the programmers about it though? Or did someone else see the higher FLOPs per dollar and bought in? If your job is literally programming a supercomputer and you can get AMD to fix stuff for you, maybe it aint so bad. That is not where most software comes from though.


It creates demand for the software to be fixed though, helping with the chicken/egg


That's why I'm building my own AMD supercomputer, for rent. Let's build the developer flywheel!




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: