Ollama is kind of ok to get started, but as I understand it they don't give you ...

solarkraft · on Nov 10, 2024

All in the name of UX. It’s modeled after Docker, so it defaults to doing things that way. Really does make for great ease of use, imo.

lurking_swe · on Nov 10, 2024

you can, when you search for a model on the ollama website there is a drop down that lets you select a “tag”. Sort of like a docker container tag. This lets you pick the quantization you want.

example: https://ollama.com/library/llama3.2/tags

mseri · on Nov 11, 2024

You can choose the quantization by appending the right tag to the model name, but they don't support other more advanced useful features (e.g. you need a special flag to enable flash attention and you cannot use KV cache quantization for large contexts).