Hacker Newsnew | past | comments | ask | show | jobs | submit | tonic_section's commentslogin

You mixed up implicit and explicit models. For anyone interested in the difference - implicit models such as GANs don't allow you to evaluate the probability density over datapoints - you can only sample from some surrogate model of the distribution learned by minimizing some 'distance' between the surrogate and the true empirical distribution.

'Explicit' models (I think this term is nonstandard) parameterize the density directly and modify the parameters via maximum likelihood. This allows one in theory to both directly evaluate the density and sample from the learned distribution. VAEs (only give a lower bound on the density), autoregressive models, and normalizing flows all fall under this category.

Note that while it is theoretically possible for 'explicit' models to go in both directions (sample and evaluate), one direction may be much more efficient than the other for certain models. e.g. for autoregressive models you can read the first two pages of [1] for a good explanation of why.

[1]: https://arxiv.org/abs/2102.11495


that's right sorry for the mix-up :)


There are a couple of solutions which work empirically - as you mentioned, one solution is a dithering-like differentiable relaxation where uniform noise is added, which simulates quantization, or just to ignore the quantization operation when taking gradients, essentially treating it as an identity operation in the backward pass.


But how do you optimize the lossless encoding of the quantized latent space? ie. how do you tell the encoder to produce something that can be well encoded, given that the encoding is a bunch of discrete steps.


Usually the lossless encoding is offloaded to a standard entropy coder, e.g. arithmetic, ANS, etc. because these approach the theoretical minimum rate given by the source entropy pretty closely, so there wouldn't be a point building a fancy differentiable replacement.


That makes sense, I don't think I stated my question very clearly: how do you control/optimize the entropy of the latent space?

ie. what stops the network from laundering all of the information for reconstructing the image through a super high entropy latent space that is hard to code but allows it to reconstruct perfectly

e: I guess I should just get up to date by reading some papers


The objective function used in these lossy neural compression schemes usually takes the form of a rate-distortion Lagrangian - the rate term captures the expected length of the message needed to transmit the compressed information and the distortion term measures the reconstruction error. So it wouldn't be able to cheat like in your example, because this would incur a high value of the loss through the rate term.


Unfortunately you wouldn't have any guarantees on the output of any particular image though, just some reassurances about the expected behaviour over the training set.


> any guarantees on the output of any particular image though

You don't have any guarantees with this non-convex optimization.

I think most of these methods would work OK on out-of-domain data.


In terms of the decoded image, yes - it's very unlikely you would get something substantially different from the original image. But in terms of the bitrate it's not hard to find examples where the compressed bitrate can be several standard deviations above the average bitrate on the training set - see e.g. the last example here: https://github.com/Justin-Tan/high-fidelity-generative-compr...

(Lossy) neural compression methods may also synthesize small portions of an image to avoid compression artefacts associated with standard image codecs, so should definitely not be used in sensitive applications where small details can make a big difference such as security imaging, guarantees or none.


Those are very cool examples and of considerably higher quality+bitrate than when I last tuned into this field half a year ago.

Unrelated, but I actually recognize your name from Github - I guess deep image compression is a pretty small space.


Hey, thanks for bringing the brightness issue to my attention - turns out I wasn't normalizing the output correctly - I just pushed a fix and the output images don't have the brightness change now.

As for the random spots, that's an artifact of the entropy coding algorithm. In principle this is lossless but there is some distortion because I'm using a custom vectorized version of an rANS encoder and it's hard to encode overflow values in a vectorized fashion, I'm working on this though. If you can live with really slow decoding times (2-3mins) then you can disable vectorization to eliminate these small imperfections entirely.

As for the comparison to the official model, that's mainly because of compute constraints v. Google (this is just my weekend project). My model uses a smaller architecture and was trained for only 4e5 steps versus the 2e6 steps they reported in the paper - even then it took 4+ days on AWS! The model is also trained on the Openimages dataset, which is presumably much smaller and more noisy than the massive internal dataset Google used.


Just curious, is the change on the model side? Since I didn't see much relevant in the notebook's rev history [1].

[1] https://colab.research.google.com/github/Justin-Tan/high-fid...


Thank you!


I pushed a workaround and provided extra instructions in the demo, so anyone experiencing errors should try that.


GDrive doesn't download the model checkpoints correctly sometimes, leading to the following error:

``` # Setup model

I get an error in the function call 'prepare_model'

UnpicklingError: invalid load key, '<'. ```

Try rerunning the download cell if you experience this - the models downloaded should be around 1.5-2GB, so if the checkpoints are 100kB in size, the download's gone wrong.


Oh, thanks so much for your replies. Really appreciate your attitude of taking that seriously and trying to mitigate. Not saying that from my point of view of thinking my error is somehow important, just from my experience putting work on HN, and knowing what it feels like when people let you know something goes wrong. Well done!

That was the error I got!


What was the error? I tried to make the demo notebook as robust as possible - you should be able to execute all cells in sequence once then execute cells out of sequence etc. without trouble, but it's hard to legislate for errors in Jupyter-like notebooks sometimes.


The models aren't downloading correctly. The content of the '*.pt' files says 'Google Drive - Quota exceeded'. I guess too many people have tried downloading the files from your drive.

One solution is to download (and upload to Colab) the models manually in /content/checkpoint/


In the step

# Setup model

I get an error in the function call 'prepare_model'

UnpicklingError: invalid load key, '<'.


I get the same


During training, you can set a target bitrate by heavily penalizing examples which exceed the target rate in the rate-distortion objective - so the model should learn to produce compressed representations at or below this bitrate. However, this constraint is only enforced on aggregate throughout the entire dataset - like many ML systems, there is no guarantee of behaviour for individual examples, either within or outside the training set. Despite this, the model appears to respect the target rate well, even on out-of-sample images.

One shortcoming is that this current model is non-adaptive - which means that the target rate is fixed. So to achieve different target compression rates you would have to train multiple models in different rate regimes. In the Colab demo there is the option to select between 3 different models trained with a target bits-per-pixel (bpp) rate at 0.14bpp, 0.30bpp, and 0.45bpp, respectively - higher rates correspond to more higher-fidelity reconstructions, at the expense of a lower compression ratio. The default is the `HiFIC-med` model (and this is what the all samples in the README were generated with), but the model trained at the highest bitrate should have less obvious imperfections.

There's also an aspect to the distortion that can be attributed to the entropy coding process rather than the model itself - currently the system clips values outside a certain probability range, resulting in artificial distortion - a fix is in the pipeline though.


Hi everyone, I've been working on an implementation of a model for learnable image compression together with general support for neural image compression in PyTorch. You can try it out directly and compress your own images in Google Colab [1] or checkout the source on Github [2].

This project is based on the paper "High-Fidelity Image Compression" by Mentzer et. al. [3] - this was one of the most interesting papers I've read this year! The model is capable of compressing images of arbitrary size and resolution to bitrates competitive with state-of-the-art compression methods while maintaining a very high perceptual quality. At a high-level, the model jointly trains an autoencoding architecture together with a GAN-like component to encourage faithful reconstructions, combined with a hierarchical probability model to perform the entropy coding.

What's interesting is that the model avoids compression artifacts associated with standard image codecs by subsampling high-frequency detail in the image while preserving the global features of the image very well - for example, the model learns to sacrifice faithful reconstruction of e.g. faces and writing and use these 'bits' in other places to keep the overall bitrate low.

The overall model is around 700MB - so transmitting the model wouldn't be particularly feasible, and the idea is that both the sender and receiver have access to the model, and can transmit the compressed messages between themselves.

If you have any questions or notice something weird I'd be more than happy to address them.

---

[1] Colab Demo: https://colab.research.google.com/github/Justin-Tan/high-fid...

[2]: Github: https://github.com/Justin-Tan/high-fidelity-generative-compr...

[3]: Original paper: https://hific.github.io/

[4]: Sample reconstructions: https://github.com/Justin-Tan/high-fidelity-generative-compr...


Would this work for a lossless / near lossless approach by having a final pass storing a delta between the compressed image and the original pixels, or do you think they diverge too much on a purely pixel-for-pixel basis for this to be valuable?


The model uses a GAN which does not learn the exact PDF. So not lossless, but as you can see from the images it gets extremely visually accurate results.

From the README

> The generator is trained to achieve realistic and not exact reconstruction. It may synthesize certain portions of a given image to remove artifacts associated with lossy compression. Therefore, in theory images which are compressed and decoded may be arbitrarily different from the input. This precludes usage for sensitive applications. An important caveat from the authors is reproduced here:

> "Therefore, we emphasize that our method is not suitable for sensitive image contents, such as, e.g., storing medical images, or important documents."


> "Therefore, we emphasize that our method is not suitable for sensitive image contents, such as, e.g., storing medical images, or important documents."

As an example of this going wrong previously, xerox had once implemented compression based on deduplicating duplicate parts of documents. Obviously numbers contains tons of duplicate symbols (digits). The problem was that the scanner software deduplicated different numbers with each other, leading to wrong numbers.

http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...


>The model uses a GAN which does not learn the exact PDF. So not lossless, but as you can see from the images it gets extremely visually accurate results.

Yes, I understand this is a lossy compression method - what I was proposing is to have the compressor as a final pass take the predicted output image, and subtract it from the original pixels. This gives you a delta between the predicted image and the original image. You can then compress that delta losslessly, and store it alongside the output of this model - if the predicted image is close enough to the original image then you've significantly reduced the amount of entropy in the delta, making it highly compressible.

This is how some domain-specific lossless compression algorithms work, e.g. DTS-HD Master Audio


Yes, the model is not lossless as this would require learning the PDF in the original input space.

However, the model does learn a conditional probability distribution over a lower-dimensional representation of the original image - this is unavoidable as entropy coding requires a distribution over discrete symbols. The GAN is almost auxiliary and not a central component of the model - in fact, you can get very good results without the GAN, but does seem to result in visually superior reconstructions.


I suspect if lossless reconstruction was your goal, you would want a different architecture. You would want the model to give you a conditional probability distribution for each pixel, conditioned on all previous pixels, so you could use a regular entropy coder to encode exact data.


As u/londons_explore mentioned, in theory you can train a model for lossless reconstruction - there are several papers about this, e.g. [1] is a good recent example. Lossless compressors need to learn a probability distribution over each input pixel, which amounts to maximum likelihood estimation in the original image space.

The model in the demo is a lossy compression method because it first projects the input to a lower dimensional space and performs quantization of this representation to integer values so the result can be ultimately entropy coded. It uses the mean-scale hyperprior model introduced in [1] to estimate the necessary probability distributions in the lower-dimensional space for entropy coding.

[1]: https://arxiv.org/abs/1811.12817 [2]: https://arxiv.org/abs/1802.01436


> or notice something weird

> [4]: Sample reconstructions

The text in the reconstructed image in the third row looks different, the word phonomat is quite garbled, information looks a bit funny.


Yeah, high frequency detail such as facial features for faraway figures or text tend to get washed out after compression - this is probably due to a couple reasons: 1) The training dataset contains relatively few pictures including text, 2) high-frequency detail is too expensive to encode and the model learns to forgo encoding this in favor of more 'important' features such as shapes, colors, etc.


Sorry, looks like both GDrive and Zenodo have exceeded the temporary download quotas, so the model checkpoints aren't available currently... If anyone has any solutions on how to publicly host model weights (~2 GB) please let me know!


I would recommend to link to a site where some example images can be easily compared (ideally with a viewer that offers toggling between them in-place to make it easy to see the differences), instead of directly linking to a colab that does heavy computations.

I assume most people just want to see the images, forcing them to recompute them is a waste of resources. Even just storing a version of the colab with the results present would help a lot.


torrent? If you create a torrent I can seed it for a while


I eventually shifted the models to S3, but thanks for the offer.


How do you handle the traffic? If every reader who clicks the link and runs the colab costs you 15 cents for traffic, that's got to get expensive unless you have some sort of "free traffic" deal or someone else is paying for it?


I think S3 permits up to 20k requests before they start billing IIRC.


I hope I'm wrong, but I believe that's for the cost of the request processing, not covering the traffic.

The free tier for traffic is "15GB of Data Transfer Out", after that it's 9 cents per GB. https://aws.amazon.com/s3/pricing/?nc1=h_ls (under "Data transfer"). Check your AWS bill!


Huh. So a DVD at 4.7gb would go from containing 5000 5mb photos to a 700mb model + 80,000 photos.


Does this have issues with out of domain images?


The model was trained on a fairly image (~1e6) dataset of diverse high-resolution natural images (the Openimages dataset) - so there was no particular training domain, and generalizes to images of arbitrary size/resolution/content well. There is a larger set of samples generated using the medium bitrate model which can be viewed in this Google Drive: https://drive.google.com/drive/folders/1lH1pTmekC1jL-gPi1fhE...

One interesting failure model is that images dominated by high-frequency detail require a relatively large bitrate to store - see e.g. the last example in the Github README with the weird brickwork. Even though the model was trained to produce compressed representations with a soft constraint on the maximum bitrate, the filesize of the representation for this particular image is something like 60% above the nominal maximum.


CuPy shares a lot of the Numpy API. I've found it pretty interchangable in most applications.


That's great, but I'm hoping that NumPy will incorporate something like this because that will better ensure that the APIs remain compatible in the future, and that they will get continued support.

(I can't convince my boss to use any library unless it has a reasonable guarantee of long-term support.)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: