You could keep multimodal projector (understanding of audio, images & PDFs) in system RAM with `--no-mmproj-offload` in llama.cpp.
Of course, then it is not accelerated with GPU, but you save its VRAM.
- you disable all communication with a firewall (so it doesn't autoupdate)
- `sudo pkill -9 LogiPluginService && sudo rm -rf /Applications/Utilities/LogiPluginService.app` (so it does not eat resources and don't run a useless service in the bg)
The most funny thing is how synchronicity worked its magic:
Roo Code experimental code indexing using vector DB dropped 3 days ago. Theire using Tree-sitter (the same as Aider) to parse sources into ASTs and do vector embedding on that product, instead of plaintext.
Can you add a recent build of llama.cpp (arm64) to the results pool? I'm really interested in comparing mlx to llama.cpp, but setting up the mlx seems too difficult for me to do by myself.
I ran them again several times to make sure the results were fair. My previous runs also had a different 30B model loaded in the background that I forgot about.
LM Studio is an easy way to use both mlx and llama.cpp