> The underlying LLM is still the same in all these scenarios. Where is the boun...

> The underlying LLM is still the same in all these scenarios. Where is the boundary?

No, it's not the same LLM; you'd have to change the LLM in all of those cases. How does it receive input from the GAN? The typical LLM is constructed to literally receive a sequence of encoded tokens. There are vision transformers, and they do chunk images into tokens, and there are multimodal transformers, but none of these are fairly described as an LLM, and they're structurally different than something like ChatGPT. And after the structural changes, it would need to be trained on some new data that associates text sequences and image sequences, and after being optimized in that context you have a _different model_.

Does being able to identify images of cats mean the model knows what a cat is? No, and we could have said that a decade ago when deep learning for image classification was making its early first advances. Does being able to describe a cat from video mean you know what the cat is? Probably not, but maybe we're getting closer. Does knowing how to pet a cat mean you know what a cat is? Perhaps not if you need to be instructed to try to pet the cat.

But suppose 10 years from now, I have a domestic robot that has a vision system, and a motor control system, and an ability to plan actions and interact with a rich environment. I would say the following would be strong evidence of knowing what a cat is:

- it can not only identify or locate the cat, but can label parts of the cat, despite the cat having inconsistent shape. It can consistently pick up the cat in a way which is sensitive and considerate of the cat's anatomy (e.g. not by the head, by one paw, etc)

- it can entertain the cat, e.g. with a laser pointer, and can infer whether the cat is engaged, playful, stressed, angry etc

- it avoids placing fragile object near high edges, because it can anticipate that the cat is likely to knock them down, even if the cat is not currently near

- it can anticipate the cat's behavior and adjust plans around it; e.g. avoid vacuuming the sunny spot by the window in the afternoon when the cat is likely to be napping there

- it can anticipate the cat's reactions to stimuli, such as loud noises, a can of food opening, etc, and can incorporate these considerations into plans

Note, _none_ of the above have anything to do with language. If I add to the robot a bunch of NLP systems to hear and understand commands or describe its actions or perceptions, it may now know that a cat is called "cat", and how to talk about a cat, but these are distinct from knowing what a cat is.

Similarly,

- a human with some serious aphasia may be unable to describe the cat, but they can clearly still know what a cat is

- a dog can know what a cat is, in many important ways, despite having no language abilities