Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I misrembered the PyPI mirror system (pre-Fastly) being more similiar to Debian[0], I didn't realize it had so many problems.

Debian managed to solve all of the concerns you listed, what makes PyPI unique?

[0]: https://www.debian.org/mirror/list



So there's a few things here:

Firstly, Debian's mirror network URLs allow a mirror operator to attack the base Debian.org site if they rely on cookies on debian.org (they may not, I'm not sure). Specifically the `ftp.<country>.debian.org` aliases cause this. On PyPI we did use cookies at the base url, so this was a non starter for us to keep.

The second thing here is that Debian and PyPI from a technical level about how mirrors are configured and hosted are generally similar. Meaning other than the above aliases, mirrors are expected to have their own domain and users are expected to configure apt or pip to point to a specific domain. Debian does have a command that will attempt to do that configuration for you to, to make it easier.

The third thing is that Debian's mirrors are as secure as the main repository is against attacks from a compromised mirror operator. This isn't the case in PyPI where you're forced to trust the mirror operator to serve you the correct packages. There is vestigal support for a scheme to support this in the mirroring PEP, but nothing ever really implemented it except the very old version of PyPI (none of the clients, etc). That scheme is also very insecure, so it doesn't really provide the security levels it was intended to.

The fourth thing is that a Debian mirror is easier to operate.

Packages on Debian don't live forever, as new versions are released old versions get removed, and as OS releases move into end of life, entire chunks of packages get rotated out. However on PyPI we don't have the concept of an OS release, or any sort of phasing out of old packages. All packages are valid for as long as the author makes them available. This means that the storage space to run a PyPI mirror (currently ~30TB) is a lot more than the storage space for a Debian mirror (~4TB).

On top of that the way apt and pip function are inherently different. Apt has users occasionally download the entire package set so that apt has a local copy of the metadata while pip asks the server for each package for the metadata (it does some light caching, but not a lot). This means that to discover what packages are available, apt might make one request a day while pip might make 100 requests for every invocation of pip. Packages on apt release a lot slower and less often than on pip. so many times people may not be needing to download more than a handful of packages, but people generally need to download a lot of packages from PyPI at a time.

I believe? the Debian mirroring protocol is rsync based, which is generally pretty reliable, while the PyPI mirroring protocol is a custom one which works, but it sometimes has a tendency to get "stuck" every few months and require operators to notice and fix themselves.

I suspect the differences between the strength of the mirror network is some combination of the two, but I suspect the the third and fourth things are the biggest differences, particularly when PyPI's CDN solved the problem in most users minds that would cause them to want to host or use a mirror.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: