Click coordinates. Agentic GUI is really annoying when the multi-modal agent cannot click on x,y coordinates.
I tested Qwen3.6, Gemma4, Nemotron3-nano-omni. They fully hallucinate x,y coords.
(did not try GLM-5V yet)
GPT-5.5 can easily do it. But also Vocaela, a tiny 500M model, is quite good at it. Hope they improve the training for x,y clicking soon on the smallish multi-modals.
Recently slopped a http service together just so my local models can click, instead of relying on all the wild ways agents currently hack into the browser (browser-use, browser-harness, agent-browser, dev-browser etc) https://github.com/julius/vocaela-click-coords-http
This sounds a lot like another hacker news posted in the last few days. The same problem image generators have with a prompt like, produce numbers 1-50 in a spiral pattern and it can't count properly. But if you break it into a raster/vector where you have it first produce the visual content and then a SVG overlay, it's completely capable.
Have you tried doing a two step: review the image, then render a vector?
Maybe there is a smart trick to get them to do the right thing, but the things I tried did not work.
At one point I had some smaller model draw bounding boxes around everything that looked interactable and labels like "e3" ... then asked the model to tell me "click on e3". Did not work in my tests was pretty much as bad as x,y.
Yeah, I've held off on doing any kind of rag till there's models that properly handle layout detection and partitioning because it's so easy to generate shitty data if you're not properly attending to visual cues first before you slice up a document.
Qwen3.5 is able to output click coordinates and bounding boxes just fine, as values normalized to 0..1000, I’d hope Qwen3.6 didn’t loose this capability.
Less information loss -> Less params? Please correct me if I got this wrong. The Intro claims:
"The dot product itself is a geometrically impoverished measure, primarily capturing alignment while conflating magnitude with direction and often
obscuring more complex structural and spatial relationships [10, 11, 4, 61, 17]. Furthermore, the way current activation functions achieve non-linearity can exacerbate this issue. For instance, ReLU (f (x) = max(0, x)) maps all negative pre-activations, which can signify a spectrum of relationships from weak dissimilarity to strong anti-alignment, to a single zero output. This thresholding, while promoting sparsity, means the network treats diverse inputs as uniformly orthogonal or linearly independent for onward signal propagation. Such a coarse-graining of geometric relationships leads to a tangible loss of information regarding the degree and nature of anti-alignment or other neg-
ative linear dependencies. This information loss, coupled with the inherent limitations of the dot product, highlights a fundamental challenge."
yes,
since you can learn to represent the same problem with less amount of params,
however most of the architectures are optimized for the linear product, so we gotta figure out a new architecture for it
Lots of people who have relatively stable currencies (EUR, USD..) do not want to use bitcoin. What if bitcoin price goes down? How many extra steps is it to convert my USD to bitcoin and then back to USD? Do I only convert the 19.99 USD for my current purchase into bitcoin or do I put in more?
Do you solve these issues for customers? Or are you only targeting people who already are happy bitcoin wallet users? Are stablecoins part of your strategy?
Given how Visa,Mastercard,Paypal are seen as bad actors. Do you think you can capitalize on that, possibly partnering with Valve or something of that sort?
We as MoneyBadger create an invoice for the customer in their local currency e.g. USD. If they pay with Bitcoin Lightning, they have 3 minutes to complete the transaction at our offered exchange rate. We take on the risk of the price moving.
If they’re paying with one of the exchange wallets we support like Luno.com, VALR.com or Binance.com we do the same, and they can choose to pay with any currency supported by those wallets.
Refunds are processed at time of refund and are for the original amount in the currency of the invoice e.g. USD but at the exchange rate at the time of refunds.
It really all just works the same as paying with a credit card overseas would if you’re paying a EUR bill with USD funds.
At first I thought you made a website that gives me an empty Markdown file. But I am glad I downloaded it its actually a pretty nice template.
What are you personally doing with the yearly goals in that file. Are you copy and pasting them from last week, or are you typing them down everytime to re-iterate them (and possibly even modify) ?
Yeah, currently I am just copy/pasting the Yearly Goals section over. I want to eventually add a feature to allow someone signed up for the email to edit that part. Then someone could modify that goal section and have it correctly emailed each week.
Mercury sounds interesting. Requires a certain scale though (gravity is a bitch).
Considering just the initial mining and construction, bodies with low gravity and proximity to the earth feel like an efficient starting point, right? I always thought the moon would be a good place to bootstrap the first few thousand space habitats.
Your point about energy will probably be the biggest deal. Wondering how complicated it would be to ship a bunch of nuclear reactors to the moon. There seems to be quite a few companies working on small, "mass produced" reactors currently.
I tested Qwen3.6, Gemma4, Nemotron3-nano-omni. They fully hallucinate x,y coords. (did not try GLM-5V yet)
GPT-5.5 can easily do it. But also Vocaela, a tiny 500M model, is quite good at it. Hope they improve the training for x,y clicking soon on the smallish multi-modals.
Recently slopped a http service together just so my local models can click, instead of relying on all the wild ways agents currently hack into the browser (browser-use, browser-harness, agent-browser, dev-browser etc) https://github.com/julius/vocaela-click-coords-http
reply