Hacker Newsnew | past | comments | ask | show | jobs | submit | bruce343434's commentslogin

You wouldn't have spent that time with loved ones, you would have been doing other tasks. Just like now, we no longer need to wait for programs to be submitted compiled and ran on the mainframe, we don't get that time for ourselves.


I'm sure some of that time would have been spent on other tasks & goals I have, yes. But I work for myself. This isn't a case of an employer capturing all the value AI unlocks & cramming more tasks into the same employee hours. I'm able to capture some of the value for myself, either in capturing more free-time, or accelerating the time savings into extra productivity / tasks solved.

As for the mainframe analogy, that's interesting, because I spend a lot of time waiting for the AI to think and complete its work. So I'm often out mowing the lawn or doing other things while I'm waiting for AI to finish. Sometimes I'm working with a second or third AI, but sometimes the usage limits won't allow that, so I may as well use the time for myself while the AI codes.


What does it mean to be friendly to memory bandwidth, and why does C++ excel at it, over, say, Fortran or C or Rust?


Actually, C, FORTRAN and C++ are friendly to memory bandwidth, written correctly.

C++ is better than FORTRAN, because while it's being still developed and quite fast doing other things that core FORTRAN is good at is hard. At the end of the day, it computes and works well with MPI. That's mostly all.

C++ is better than C, because it can accommodate C code inside and has much more convenience functions and libraries around and modern C++ can be written more concisely than C, with minimal or no added overhead.

Also, all three languages are studied so well that advanced programmers can look a piece of code and say that "I can fix that into the cache, that'll work, that's fine".

"More modern" programming languages really solve no urgent problems in HPC space and current code works quite well there.

Reported from another HPC datacenter somewhere in the universe.


I suppose that most HPC problems are embarrassingly parallel™, and have very little if any mutable shared state?


I'd say that the opposite is more often the reality, which is why HPC systems tend to have high-bandwidth, low-latency networks.


High bandwidth may mean the need to consult some very large but immutable data structure. As a trivial example, multiplying two matrices requires accessing each matrix fully multiple times over, but neither of them is altered in the process, so it can safely be done in parallel. Recording the result of a (naive) matrix multiplication can also be done without programmatic coordination, because each element is only updated once, independently from others.

This is very unlike, say, a database engine, where mutations occur all the time and may come from multiple threads.

Rust specifically makes it hard to impossible to clobber shared mutable state, e.g. to produce a dangling pointer. But this is not a problem that our matrix-multiplication example would have, so it won't benefit from being implemented in Rust. Maybe this applies to more classes of HPC problems.


The HPC infrastructure is not like you're used to using. It is very high bandwidth but latency is dependent on where your data lives. There's a lot more layers that complicate things and each layer has a very different I/O speed

https://extremecomputingtraining.anl.gov/sites/atpesc/files/...

Also how to handle the data can be very different. Just see how libraries like this work. They take advantage of those burst buffers and try to minimize what's being pulled from storage. Though there's a lot of memory management in the code people write to do all this complex stuff you need so that you aren't waiting around for disks... or worse... tape

https://adios-io.org/applications/


On the contrary. However, they tend to manually manage memory rather than outsourcing it to a language runtime or a distributed key-value store.


I'd say it's being able to structure your data however suits your problem and your hardware, then being able to look at a profile and being able to map read/writes back to source. Both C and C++ excel at this.

The advantage of C++ over C is that, with care, you can write zero-cost abstractions over whatever mess your data ends up as, and make the API still look intuitive. C isn't as good here.


According to my experience, all the "zero-cost abstractions" from C++ most of the time make it more annoying to maintain and/or understand the code, especially with respect to resource management, introduce compatibility issues at the toolchain level, and - even when looking perfect in toy benchmarks - are often not even zero-cost (e.g. all the bloat the templates generate often hurts).


Is Fortran 90 not flexible enough in defining data layout?


Parent talks about new languages, as per the article Fortran or C doing fine. I speculate the benefit of C++ over Rust how it let programmers instruct the compiler of warranty that goes beyong the initial semantic of the language. See __restrict, __builtin_prefetch and __builtin_assume_aligned. The programming language is a space for conversations between compiler builders and hardware designers.


It is just super unpleasant to write low level software in rust.

There is a colossal ergonomics difference if you compare using clang vs rust to writing a hashmap for example.

C compilers just have everything you can think of because everythin is first implemented there.

Using anything else just seems kind of pointless. I understand new languages do have benefits but I don't believe language matters that much really.

The person who writes that garbage pointer soup in C write Arc<> + multi threaded + macro garbage soup in Rust.


I believe __restrict, and __builtin_prefetch/__builtin_assume are compiler extensions, not part of the C++ language as is, and different compilers implement (or don't) these differently.

The rust compiler actually has similar things, but they're not available in stable builds. I suppose there are some issues if principle why not to include them in stable. E.g: https://doc.rust-lang.org/std/intrinsics/fn.prefetch_read_da...

Maybe some time in the future good acceptable abstractions will be conceived for them.. Perhaps using just using nightly builds for HPC is not that far out, though.


Rust already has __restrict; it is spelled &mut and is one of the most fundamental parts of the language. The key difference, of course, is that it's checked by the compiler, so is useful for correctness and not just performance. Also, for a long time it wasn't used for optimization, because the corresponding LLVM feature (noalias) was full of miscompilation bugs, because not that much attention was being paid to it, because hardly anyone actually uses restrict in C or __restrict in C++. But those days are finally over.

__builtin_assume is available on stable (though of course it's unsafe): https://doc.rust-lang.org/std/hint/fn.assert_unchecked.html

There's an open issue to stabilize the prefetch APIs: https://github.com/rust-lang/rust/issues/146941 As is usually the case when a minor standard-library feature remains unstable, the primary reason is that nobody has found the problem urgent enough to put in the required work to stabilize it. (There's an argument that this process is currently too inefficient, but that's a separate issue.) In the meantime, there are third-party libraries available that use inline assembly to offer this functionality, though this means they only support a couple of the most popular architectures.


btw. Fortran is implicitly behaving as "restrict" by default, which makes sense together with its intuitive "intent" system for function/subroutine arguments. This is one of the biggest reasons why it's still so popular in HPC - scientists can pretty much just write down their equations, follow a few simple rules (e.g. on storage order) and out comes fairly performant machine code. Doing the same (a 'naive' first implementation) in C or C++ usually leads to something severely degraded compared to the theoretical limits of a given algorithm on given hardware.


Oh I actually had some editing mistake, I meant to say that also Rust has restrict by default, by virtue of all references being unique xor readonly.

As I understand it, the Fortran compiler just expects your code to respect the "restrictness", it doesn't enforce it.


So that's where the intent system comes in (an argument can be in/out/inout) as well as the built-in array sizes, because it allows you to express what you want and then the compiler will enforce it. In Fortran you kinda have to work hard to invade the memory of one array from another, as they are allocated as distinct memory regions with their own space from the beginning. Pointer math is almost never necessary. Because there is built-in support for multidim arrays and array lengths, arrays are internally anyways built as flat memory regions, the same way you'd do it in C-arrays for good performance (i.e. cache locality), but with simple indices to address them. This then makes it unnecessary to treat memory as aliased by default.

Honestly, I still don't get why people have built up all these complex numerics frameworks in C and C++. Just use Fortran - it's built for exactly this usecase, and scientists will still be able to read your code without a CS degree. In fact, they'll probably be the ones writing it in the first place.


There are good reasons to use Fortran, some having to do with the language and many to do with legacy codes. These have to be balanced with the good reasons to avoid using Fortran for new development, which also have to do with the language and its compilers.


To me it just boils down to using the right tool for each job. I definitely wouldn’t use Fortran for anything heavily using strings. One weakness is also the lack of meta programming support. But for numerical code to be run on a specific hardware, including GPU, it’s pretty close to perfect, especially also since NVIDIA invested into it.


I’m glad you like it.


restrict is in C99. I’m not sure why standard C++ never adopted it, but I can guess: it can be hard to reason about two restrict’d pointers in C, and it probably becomes impossible when it interacts with other C++ features.

The rest are compiler extensions, but if you’re in the space you quickly learn that portability is valued far less than program optimization. Most of the point of your large calculations is the actual results themselves, not the code that got you there. The code needs to be correct and reproducible, but HPC folks (and grant funding agencies) don’t care if your Linux/amd64 program will run, unported, on Windows or on arm64. Or whether you’ve spent time making your kernels work with both rocm and cuda.


In my experience fiddling with compute shaders a long time ago, cuda and rocm and opencv are way too much hassle to set up. Usually it takes a few hours to get the toolkits and SDK up and running that is, if you CAN get it up and running. The dependencies are way too big as well, cuda is 11gb??? Either way, just use Vulkan. Vulkan "just works" and doesn't lock you into Nvidia/amd.


Vulkan is a pain for different reasons. Easier to install sure, but you need a few hundred lines of code to set up shader compilation and resources, and you’ll need extensions to deal with GPU addresses like you can with CUDA.


Ah yes, but those hundred lines of code are basically free to produce now with LLMs...


Whatabout the extensions? is it widely supported


That is always one check away: https://vulkan.gpuinfo.org/listextensions.php


VK_KHR_buffer_device_address has 91.3% support

and

VK_KHR_variable_pointers has 98.66% support

looks good to me


Haha. People have already said what is Vulkan in practice - it's very convoluted low-level API, in which you have to write pretty complicated 200+LoC just to have simplest stuff running. Also doing compute on NVIDIA in Vulkan is fun if you believe the specs word for word. If you don't, you switch a purely compute pipeline into a graphical mode with a window and a swapchain, and instantly get roughly +20% of performance out of that. I don't know if this was a bug or an intended behavior (to protect CUDA), but this how it was a couple years ago.


On Windows: download a 3GB exe and install

On Linux: add repository and install cuda-toolkit

Does that take a few hours?


Assemble a navy in space then just airdrop it through the atmosphere?


Is a moral compass something you can teach someone in a short course if they have been lacking it so far in their entire lives?


You say "seems like", can you argue/show/prove this?


I think that many obo errors are caused by common situations where people can mistakenly mix up index and count. You could eliminate a (small) set of those situations with 1-based indexing: accessing items from the ends of arrays/lists.


And in turn you'd introduce off by one errors when people confuse the new 1-based indexes with offsets (which are inherently 0-based).

So yeah, no. People smarter than you have thought about this before.


As a developer who spent a couple months developing a microservice using aws lambda functions:

it SUCKS. There's no interactive debugging. Deploy for a minute or 5 depending on the changes, then trigger the lambda, wait another 5 minutes for all the logs to show up. Then proceed with printf/stack trace debugging.

For reasons that I forgot, locally running the lambda code on my dev box was not applicable. Locally deploying the cloud environment neither.

I wasn't around for the era but I imagine it's like working on an ancient mainframe with long compile times and a very slow printer.


I've witnessed developers editing Lambda code live in the AWS console. It is extremely painful to watch.


Lol exactly


https://alphacephei.com/vosk/install

> We also provide a websocket server and grpc server which can be used in telephony and other applications. With bigger models adapted for 8khz audio it provides more accuracy.


Quiet chaos.


What's wrong with the LUKS password protection?


There is nothing "wrong" with passwords, but they have trade-offs like most approaches. The actual LUKS key is usually wrapped in a password protected record(s) commonly on the storage media by default. That method is usually weaker than the key itself.

Note 10000 GPUs can brute force passwords rather quickly (a pre-sharded search space is fast), and key exfiltration for targeted individuals/firms still happens.

Options like modern TPM include anti-brute force features, but has other attack surfaces. Everyone has their own risk profile, and it is safer to assume if people want in... they will get in sooner or later. ymmv =3


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: