Hacker Newsnew | past | comments | ask | show | jobs | submit | lloyd-christmas's commentslogin

I read another post oddly similar earlier today that has more explicit data on that authors codebase: https://codepointer.substack.com/p/cutting-llm-token-costs-w...

TLDR; ~3-4% savings to actual API costs with rtk, caveman, and headroom combined, but nothing tangible on if those cost reductions came at a cost of quality. By their calculations, rtk saved them $4.96 on a $926 bill.


^recommend reading this one

Not the person you asked, but I have a 9700 which has the same VRAM, and running Q6 on it with unquantized kv gives me 50k context. Putting -ctv q8_0 ups that to 70k. I normally run Q4 with unquantized kv @ 130k at 50 t/s (mtp 3), with the disclaimer that I'm running PCIe gen4x8, so I'm slightly slowed. I've found that quantizing k leads to broken json on tool calls, which is fairly unrecoverable, but YMMV.

I run qwen 27B:Q4 @ 130k context at 50 t/s on a single R9700, and have a 7900XT that runs mellum 12B:Q8 as its subagent. R9700s do really well at low wattage and underclocking as well. It's designed to run at 300W, mine is throttled at 210W, and only had an 8% slowdown. If I had somewhere else to put my desktop in my house, I'd bump it up to 240W and there would be zero perf degradation.

I thought the same thing when I started using locals, but the reality is that - for a given context depth - the token generation speed doesn't change whether it's 128 or 8000, it just lengthens the benchmark run time.


I suspect this is it. I'm 40, and the only tech person in my social circle. Many of my friends were all excited about using it for things like basic webdev and home networking. One shotting that type of stuff is very viable even if you don't know anything about the topic. Now that they are trying to use it for something they actually know about, suddenly it's unusable. It's a modification of Gell-Mann Amnesia.


Or write your own custom one with the library that backs it: https://github.com/FluidInference/FluidAudio

I did that so that I could record my own inputs and finetune parakeet to make it accurate enough to skip post-processing.


There's a fork of FluidAudio that supports the recent Cohere model: https://github.com/altic-dev/FluidAudio/tree/B/cohere-coreml...

It's used by this dictation app: https://github.com/altic-dev/FluidVoice/


When I worked remote, I found I worked more hours than the days I went into the office and spent 40 minutes each day commuting. It's a lot harder to walk away from work when it is embedded in your home life. I actually quit my most recent job when they decided to close the office and move to full time remote work, so YMMV.


A compromise could be a co-working space or separate office in the house. It is very beneficial to have a clear divider between home and work life, so if you felt you weren't achieving it I totally understand the departure.


For me, I still have no idea what the visual of a venn diagram is trying to tell me within the context of SQL. Just having a visual that I can't even understand does nothing to benefit the situation either.


Employee Resource Group, I assume.


> the sheer amount of screen real-estate and convenience is really unparalleled.

Personally, that's my exact complaint. I miss smart phones I could operate with one hand without straining my thumb to reach the opposite corner. I want a quality smart phone with a screen size ~4.5". I don't really want a screen larger than that. It would be nice if the big brands made a small version instead of a "normal" and "XL". The iPhone homepage currently has the motto "Welcome to the big screens". If I wanted a tablet, I'd buy a tablet.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: