I'm honestly surprised that they trained a StyleGAN. Recently, the Imagen architecture has been show to be both easier in structure, easier to train, and even faster to produce good results. Combined with the "Elucidating" paper by NVIDIA's Tero Karras you can train a 256px Imagen* to tolerable quality within an hour on a RTX 3090.
Here's a PyTorch implementation by the LAION people:
Or, since they are comparing to Craiyon, why not just finetune Craiyon itself? Craiyon already exists, just take it off the shelf, you don't need to retrain it from scratch, so the cost to train it from scratch on everything (which is indeed quite large) is not relevant to someone who just wants to generate great food photos.
Here's a PyTorch implementation by the LAION people:
https://github.com/lucidrains/imagen-pytorch
And here's 2 images I sampled after training it for some hours, like 2 hours base model + 4 hours upscaler:
https://imgur.com/a/46EZsJo
* = Only the unconditional Imagen variant, meaning what they show off here. The variant with a T5 text embedding takes longer to train.