Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Reasonable GPUs
72 points by frognumber on Nov 26, 2023 | hide | past | favorite | 48 comments
What is the status of GPUs for general compute?

Last I looked, NVidia worked well, and AMD was horrible. Right now, it looks like the major limiting factor (if you don't care about a ≈3x difference in performance, which I don't) is RAM. More is better, and good models need >10GB, while LLMs can be up to 350GB.

* Intel Arc A770 has 16GB for <$300. I have no idea about compatibility with Hugging Face, Blender, etc.

* NVidia 4060 has 16GB for <$500. 100% compatible with everything.

* Older NVidia (e.g. Pascal era) can be had with 24GB for <$300 used, without a graphics port. Not clear how CUDA compute capability lines up to what's needed for modern tools, or how well things work without a graphics port.

* Several cards may or may not work together. I'm not sure.

Is there any way to figure this stuff out, and what's reasonable / practical / easy? Something which explains CUDA compute levels, vendor compatibility, multi-card compatibility, and all that jazz. It'd be nice to have a generic enough guide to understand both pro and amateur use, e.g.:

- A770 x21, if someone got it working, could handle Facebook's OPT-175 for <$10k via Alpa. That brings it into "rich hobbyist" or "justifiable business expense" range. Not clear if that's practical.

- Kids learning AI would be much easier if it's cheaper (e.g. A770)

- "General compute" also includes things like Blender or accelerating rendering in kdenlive, etc.

- Etc.

This stuff is getting useful to a broader and broader audience, but it's confusing.



It really depends on what you're trying to do.

This is sorta _the_ guide on GPUs for DL and has a great decision tree https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

Personally, I'm limited to an RTX 2080 for my personal projects at the moment, and I find the constraint pretty rewarding. It forces me to find alternatives to the huge models, and you'd be surprised what you can eek out when you pour in the time to tweak models. Of course, good data is also pinnacle.


nvidia specific, best write up I know: https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

Across vendors, generally, Nvidia still dominates currently. People are adding more support into ML libraries for other vendors via (second-class imo) alternate backends but expect to be patient if you're waiting for the day when there is healthy competition.

IMO, I'd say: if you can save up for it, get a 4090; if you can save up for half a 4090, get a 3090 - seen many going for 600-800 now. If you can save up for half a 3090, I'm not sure - depends on if you prefer speed or VRAM. If it were me, I'd pick more VRAM first.

re: compute capability, you can see here:

- which GPUs have what cc: https://developer.nvidia.com/cuda-gpus

- what cc comes with what features: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

I think the main qualitative change (beyond bigger numbers in the spec) for an enduser of machine learning libraries from 8.6 -> 8.9 (ie 3090 -> 4090) is this line:

> 4 mixed-precision Fourth-Generation Tensor Cores supporting fp8, fp16, __nv_bfloat16, tf32, sub-byte and fp64 for compute capability 8.9 (see Warp matrix functions for details)

ie new precisions will be builtin to eg pytorch with hw-level/tensor core support

edit: btw you probably ought to stick to a consumer gpu (ie not professional) if you want it to be generally versatile while also easy to use at home.


What about 4060 ti 16gb? It was released after this guide, costs ~500eur and is a bit faster, newer (and a lot more efficient) than a 3060


Wasn't this the one deliberately chocked by a narrow memory bus to prevent decent non-gaming workload performance?


This. If you insist on being as cheap as possible, shoot for a 12gb card but be aware that you'll be missing out on the throughput of higher-end models. The 3060 is popular for this I think, but you'll probably want a better card with more CUDA cores to max out performance.

Cards like the A770 are awesome, but barely even support raster drivers on DirectX. Your best bang-for-buck options are going to be Nvidia-only for now, with a few competing AMD cards that have fast-tracked Pytorch support.


I purchased a 3060 specifically for the 12gb of memory last November and I've been able to run llama, alpaca, stable diffusion out of the box for everything without ever having any memory issues. Training is usually overnight, and a stable diffusion will render in ~5 seconds, llama will do 20 tokens/second.

I would say start with the 3060 for 250 bucks, and if you're still loving it after a couple months, drop 10x more on a quadro.

My only word of advice is get docker setup and install the nvidia docker toolkit to passthrough your gpu to docker images -- the package management for all these python ai tools is a hell-scape, especially if you want to try a bunch of different things.


Thank you. This is super-helpful.

> re: compute capability, you can see here:

My key question is much more pragmatic:

1) If I grab a random model from Hugging Face, will it accelerate?

2) If I run Blender, kdenlive, or DaVinci Resolve, will it accelerate?

Is there a line where things break?

I definitely prefer more VRAM to more speed. As an occasional user, speed doesn't really matter. Things working does.


> If I grab a random model from Hugging Face, will it accelerate?

Probably, it depends more on how you configure the inferencing software. Most software that supports acceleration starts with CUDA or CUBLAS, so you should be good.

> If I run Blender, kdenlive, or DaVinci Resolve, will it accelerate?

Yep. If you're running Linux, some distros might be a little iffy about shipping the proprietary/accelerated versions of this software, but most are fine. The Flatpak versions should all have Nvidia acceleration working out-of-box, if you do encounter any issues.

> Is there a line where things break?

Yes, but you can avoid it by choosing smaller quantizations and giving yourself a few gigs of VRAM headroom. In my experience, it's always better to select a model smaller than you need so you're not risking an OOM crash (I've got a 3070ti).

Lotta other great advice in this thread, though! Good luck picking something out.


For Blender, you can actually check crowd-sourced public benchmarks for basically any CPU and GPU you want to compare https://opendata.blender.org/

That site is a goldmine for perf benchmarks, I actually use that site if I want to do a rough comparison of GPU performance across models for 3D / animation / gaming uses. Even though that is Blender specific, I'm pretty confident the results apply to any usage in the same class of applications.


For Blender, you must carefully read requirements of your software.

Unfortunately, only NN software are more or less standardized, so in many cases, you could choose best fit for your pocket, but all other could be tightly coupled not even to one brand, but to one model. For example, I've seen some software which work in Nvidia-960; I'm not sure about 1060; it don't work on 2060 (for some reason, developers avoid this series).


Also remember, Nvidia prohibited to virtualize their hardware for all gaming cards (only professional lines allowed), even pushed virtual machines vendors to extract support of Nvidia gaming cards (for example, Xen have official statement on this).

But AMD and Intel does not follow Nvidia in this controversy, and all their officially supported cards could work under virtual environment.

This is not unbreakable issue, for example could use old drivers or from independent open source, but in some cases this could be very annoying.


To be honest, the best idea for most people is probably just any GPU that you can easily afford and then rent a big iron GPU.

There is almost no way you will make back the $5k for a 40GB+ ram card, so just save yourself all the hassle and go for something that ticks all the rest of your boxes.

Non-CUDA cards may be ok if you have very simple requirements, but I'd expect many hours of debugging if you want something that's not ready to go out of the box.


I agree. Availability is a pain in the ass which might a dealbreaker for urgent interactive use cases but a 48GB A6000 on LambdaLabs is $0.80/hr [1]. A newer 80GB H100 is $1.99/hr so especially if you're trying to do batch processing and can script a bot to wait for availability, it's often a much better option.

With that aforementioned A6000 ($5k retail) you'd have to use it for at least six thousand hours to break even on the cloud cost.

[1] https://lambdalabs.com/service/gpu-cloud#pricing


That seems like a lot, but that's only ~8 months of usage. If you are doing consistent work with large models, or plan to for over a year, then it makes sense to at least have some hardware.

Something people forget too is that if you have no Nvidia GPUs at all locally, you'll need to spend an significant amount of time installing a new node, copying data, and debugging in your cloud instance, each time you want to do something, while being charged for it. It's a pretty big boost in terms of my time to develop locally and then scale to the cloud once something smaller scale is working.


I agree especially with the second argument.

But most people who toy with LLMs will probably never make money out of them. Even those who do will often spend a lot of time getting their bearings during which the GPU sits idle. Then you begin to ramp up your use but by the time, there's a new generation of GPUs out.

That's why my recommendation is to start with something lightweight.

It's also much less frustrating to start working for a few hours on a rented A100 rather than running into OOMs all the time while fine-tuning batch sizes and waiting for the nth highly quantized model to download.


8 months of 24/7 usage, so for most people it will still take years.


Fair enough - I wouldn't recommend going with a 5K GPU for home use either. 3090s or 4090s!

I have 2 4090s personally, which is perfect for pretty serious 7B fine-tuning and inference, and doing development work on smaller stuff before scaling to larger runs in the cloud.

At work anything less than 8 GPUS per run is small time stuff - we sometimes scale up to 128 or 256 GPUs for some runs.


Just to clarify, because this advise might be misleading. These LambdaLabs prices are pretty much meaningless, because there are no available instances currently, and haven't been for months. The last time I saw an available _hourly_ A6000 instance was more than 6 months ago. Forget about H100. You might be able to get a reserved instance if you're willing to commit a significant enough amount, but even that is probably impossible right now for H100 instances.


Rationally, this makes sense.

Emotionally, it doesn't. The problem is if I own something, I'll use it freely. If I rent a GPU, I'll be stressing and counting pennies. In practice, I'll use it less.

On the whole, I'd rather buy even if it costs more, because I'll use it, and in the long term, that pays dividends.

That's not everyone. That's me.


I had a similar question and after reading far too much I put together https://coinpoet.com

I'd love to have others here try it out and give me some feedback on how I could make it useful. It's only a couple weeks in but already seems valuable to me. What am I missing?


Ebay 3090 or new @ 1599$ 4090 (founders edition, gigabyte windforce v2), are the best price/performance/ease of use in my opinion.

AMD is too funky for most still. I have an Mi60 that won’t load drivers due to some PSP (platform security processor) missing firmware on the GPU…


I would really like a "reasonable" monthly price for a VPS with a GPU. Even a consumer card like a 3090.

vasti.ai have the best prices I have seen, but comes with limitations, and still not the best deal for an entire month.


Given the cost of the hardware and power and everything else required to run it and support it, how can you say that 20 cents an hour is unreasonable? There's very little profit margin there. At this price it would take roughly a year for them to make a profit. If you need continuous usage at the lowest price then you need to buy a GPU on ebay.


Nvidia has a datacenter tax in their driver terms of use, so you won't find consumer card vpses at consumer like prices.


Most offers are fine, the A100 I rented was a scam, however. A scam in terms of: advertised as A100, performing like a 1080. I guess the seller partitioned the card or rigged the id. You can report frauds like this on their page but only while you are renting.


I would hope vast.ai would be able to detect MIG at least.

It could also be a low power cap - I had a Dell C4140 for a bit with 220V power supplies and 120V power, locking the entire thing to ~50% of the max power cap per GPU basically.


That is fine when you are using your own machine and fraud when you rent it to others.


Checked out vast.ai but you can get down to the ~$0.34/hr at Runpod depending on how much vram you need.


Is shared ram any useful? For instance a mini PC with an AMD 7940HS chip and 64GB of ddr5 ram costs about 800€. At less than half the price of just a GPU, I am not expecting any great results, but is it usable?


I was curious about this topic too because M3 macs have "Unified memory" which is shared amongst their CPU/GPUs. Anyone have a link or explanation of how this works?


One of the main bottlenecks for inference is memory bandwith (esp when dealing with huge models, like SD/SDXL) and for that, nothing I know of comes close to matching memory speeds on Apple Silicon (up to 400GB/s).


You can run very large models on a Macbook M2 with 96 GB. They run 1/3 to 1/4 slower in tokens/s than the faster hardware, but they fit in memory.

(400 GB/s is a lot in the form factor but the 4090 and equivalent have 1 TB/s, and H100s several times that)

Edit: Here someone asked the same question: https://www.reddit.com/r/LocalLLaMA/comments/14319ra/rtx_409...


Thanks, I've been trying to find this information - Mac shared memory vs Nvidia VRAM performance differences - for the longest time and your answer and the Reddit link were both super helpful!


You’re welcome — it’s too late for me to edit my wording but hopefully you understood what I meant by “1/3 to 1/4 slower.”

That is ambiguous, instead I should have said that models that fit in memory on both take 3-4 times as long on the M2 as they do on the 4090.


as I know - only 16GB VRAM allowed in 7940HS ( via BIOS )

so probably you can expect similar results:

https://old.reddit.com/r/Amd/comments/15t0lsm/i_turned_a_95_...

HN https://news.ycombinator.com/item?id=37162762


Older generation RTX 8000 / 48GB are reasonable.

One big disadvantage for older Turing card, no bfloat16. But if you run a quantized/mixed precision model or QLoRA, it doesn’t hurt as much.


If I just needed a GPU for learning purposes, is 2xGPUs necessary? Would a single 24GB GPU significantly bottleneck training with any publicly available datasets? Just need something faster than my laptop, but if it takes twice as long, not really an issue.


Training what on what - Resnet50 on imagenet? Yeah sure a single 4090 is fine. Will take a bit.

A 1.5B parameter LLM? That’s a few weeks with 64 V100s - on a small dataset.

Training something Lllama 7b class? (Not using lora)? Weeks with the same number of A100s.

With lora? Back to a single 4090 - depending on your dataset. It still might take weeks to go through 2000 examples for finetuning with a large context size.


Try to get motherboard based video output to handle your OS that way all the gpu ram is available. Can get a cheap 2nd gpu also if it fits in the case or open rack setup you have in mind.

The exllamaV2 I run allows multiple gpu’s of different ram amounts.


Booting in console mode saves some gpu form ubuntu graphical interface, in case you are almost able to run a model. It is hacky, but it might help someone.


Raja Koduri - 2023-11-25 - https://twitter.com/RajaXg/status/1728465097243406482

"" Very encouraging to see the steady increase of viable hardware options that can handle various AI models.

At the beginning of the year, there was only one practical option - nVidia. Now we see at-least 3 vendors providing reasonable options. Apple, AMD and Intel. We have been profiling several options and I will share some of our findings here.

The good stuff

- Apple Macs were a pleasant surprise on how easy it is to get various models running

- AMD also made impressive progress with PyTorch and a lot more models run now than even 4-5 months ago on MI2XX and Radeon

- We tried both Intel Arc and Ponte Vecchio and they were able to execute everything we have thrown at them.

- Intel Gaudi has very impressive performance on the models that work on that architecture. It's our current best option for LLM inference on select models.

- Ponte Vecchio surprised us with its performance on our custom face swap model, beating everyone including the mighty H100. We suspect that our model may be fitting largely in the Rambo cache.

The wishlist

- For training and inference of large models that don't fit in memory - nVidia is still the only practical option. Wishing that there are more options in 2024 here

- While compatibility is getting better, a ton of performance is still left on the table on Apple, AMD and Intel. Wishing that software will keep getting better and increase their HW utilization. There is still room on compatibility as well, particularly with supporting various encodings and model parameter sizes on AMD.

- Intel Gaudi looks very promising performance-wise and wishing that more models seamlessly work out of the box without Intel intervention.

- Wishing that both AMD and Intel release new gaming GPUs with more memory capacity and bandwidth.

- Wishing that Intel releases a PVC kicker with more memory capacity and bandwidth. Currently it's the best option we have to bring our artists workflow with face swap training from 3-days to a few hours. It scales linearly from 1-GPU to 16-GPUs.

- Wishing Intel support for PyTorch is as frictionless as AMD and nVdia. May be Intel should consider supporting PyTorch RocM or up-stream OneAPI support under CUDA device.

really grateful to all vendors for providing access to hardware and developer support.

Looking forward to continue filling our data center with interesting mix of architectures. ""


Related question, does anyone have experience with using the AMD MI100 for deep learning? With 32GB and a second hand price of ~1100 USD, it could be a good choice.


I've been really curious about these, but my experience with an MI60 and partially my 6900XT has not endeared me towards using AMD cards - the MI60 refuses to init in linux, due to some PSP firmware issue, and the 6900xt is missing pre-compiled HIP stuff leading to super long initial launches - as. it JIT builds the kernels - at least in PyTorch.

Allegedly they perform near an A100, so raw-compute wise, memory capacity wise, and memory bandwidth wise, they rock. As is typical for anyone not Nvidia, the software is still playing catch up. To be fair, Nvidia themselves takes nearly a year to build out all CUDA features for some of their cards - FP8 for example, is only recently become usable on a 4090.


if it were me right now, I'd def go with a used 4060 or ti


Last I looked, NVidia worked well, and AMD was horrible.

are you sure ?


used 3090 is the best option imo


Ask HN:


I’m using 4x A6000 Ada cards for EM simulation. They have 48 GB ECC and 2-slot width so can be accommodated in a server case, versus 24 GB non ECC and 3-slot width for the 4090. They are actually faster for FP32 than the A100, but really poor for FP64.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: