Dear pip, we have to talk (part 3 of 3)

Majic

Dear pip, we have to talk (part 3 of 3)

Pingbacks

Comments

The final in its series, this post tries to cover issues related to security of Python packages, primarily in context of using pip for package management and installation.

In order to fully understand the problems, we'll look into how typical GNU/Linux distributions handle package integrity and verification.

Most of the distributions you might be using come with some form of integrity protection for packages. This usually revolves around providing metadata and/or packages, signed with a private key, and using distribution-provided set of public keys to verify the validity of signatures.

This allows us that every time we download a package we can confirm that it does indeed come from the same entities that have shipped the distribution to us, even when communicating over unreliable channels (for example, Debian normally does not use HTTPS for repository URLs).

Of course, nothing is bullet-proof, and someone could compromise distrubtion's signing keys. However, provided we have picked a distribution with proper security policies and procedures, the risk should be minimised. At the end of the day, we need to prevent casual spoofing of packages, one that will cover majority of easy attacks.

Once we leave the embrace of distribution for Python package management, and start using pip, we open oursleves to plethora of possible security issues caused by decentralised nature of package contributions to PyPI.

Suddenly, instead of having a very specific, limited set of entities to trust, we need to put trust into plethora of people and organisations, all with their own understanding of best security practices.

From security perspective, when using pip to install packages from PyPI, we are faced with the following problems:

Anyone can open-up an account on PyPI. There is no form of identity verification.
There is no centralised QA process to ensure packages uploaded to PyPI behave correctly, non-malciously, or that they are even fit for purpose. We are essentially at the mercy of uploader (this is not belittling the contributions of such induviduals, merely a fact).
Although PyPI serves packages over HTTPS, this only ensures identity and communication with server is protected. Even checksums associated with packages (that form part of URL) merely serve the purpose of ensuring the package download is not corrupt. E.g. it only prevents downloading incomplete archives.
If somebody compromises an uploader's account, this grants them ability to replace existing and upload new versions of package that may contain malicious code. Since pip has no signature mechanism, there is no way we can associate package with the uploader for automated verification of package integrity. E.g. we can't guarantee that what the maintainer had initially uploaded is still the same file we are downloading.

To make things even more interesting, lack of digital signatures can be combined with a typo squatting attack, which was demonstrated and tested last year. This attack revolves around common user misstypes of package names (for example, running pip install djagno instead of pip install django).

Another security issue (or maybe it could be called a curiosity) that deserves some more explanation is related to how pip determines dependencies for packages we are installing. In order to determine what packages need to be installed as pre-requisites, pip needs to:

Download the package.
Unpack it.
Run the setup.py script to extract dependency information.
Repeat above steps recursively for every matched dependency.

Suffice to say, the weak point here is execution of setup.py. This script is provided by the package maintainer, and there is no restrictions on what it can or should do. This means that a mere check to see what would get installed can result in malicious code being executed. This is a consequence of design decision that metadata is embedded in the setup script - which makes it both powerful and dangerous.

Is there anything we can do to alleviate these issues? Well, only to some extent.

As discussed before, there are no digital signatures involved with upload of packages to PyPI. Therefore, we lack the (automated) mechanism to verify package the first time we download it.

At the very best we can try to download a specific package version once, assume that the package is not compromised, and ensure that whenever we install this specific version of package, we use the same package archive.

This can be easily achieved using checksums. E.g. we:

Download package version for the first time.
Eventually perform some other form of verification that might be provided by the package maintainer via means outside of PyPI.
Calculate the checksum of package archive.
Every next time we install this specific package version, compare the downloaded package archive checksum against the stored value.

Doing all this manually can prove quite tedious. Even with scripting, we would need to give up on a lot of functionality revolving around use of pip and PyPI.

Luckily, since version 8.0.0, pip has built-in support for specifying checksums when installing packages via requirements files. Preparing the checksums is still a manual procedure, however.

As an example, let's try to introduce checksum verification for Django and its dependencies.

We'll start-off by creating a dedicated virtual environment and directory:

mkvirtualenv piphash
mkdir ~/piphash/

Next, we'll install and also download the packages in order to more easily create the initial requirements file. As mentioned before, verifying integrity of first download has not been solved. We assume that at this point in time the PyPI service and maintainer accounts have not been compromised (which should be considered a very strong, or even naive, assumption).

cd ~/piphash/
pip install django
pip download django

We create the requirements file using pip freeze:

pip freeze > requirements.txt

We calculate checksums of all downloaded files (ignoring the hash for requirements.txt itself):

pip hash *

Now we update the requirements file to include the hashes - making sure they match correct package. The resulting file should look similar to (your versions and packages will probably differ):

Django==1.11.1 --hash=sha256:bb3109a31cfa016e5f234223665f80fc06107f2169afb9f6dc8828295db73547
pytz==2017.2 --hash=sha256:d1d6729c85acea5423671382868627129432fba9a89ecbb248d8d1c7a9f01c67

At this point we can try recreating the virtual environment, and reinstalling packages via the requirements file:

deactivate
rmvirtualenv piphash
mkvirtualenv piphash
pip install -r ~/piphash/requirements.txt

In order to confirm that checksums are actually verified, we can invalidate them (one or both) in requirements.txt by replacing the last character of hashsum with something else. Following our example from above (replacing just the last character with 1):

Django==1.11.1 --hash=sha256:bb3109a31cfa016e5f234223665f80fc06107f2169afb9f6dc8828295db73541
pytz==2017.2 --hash=sha256:d1d6729c85acea5423671382868627129432fba9a89ecbb248d8d1c7a9f01c61

We force reinstallation using the same requirements file:

pip install --ignore-installed -r ~/piphash/requirements.txt

If all went "well", we should see an error message at this point similar to:

Collecting Django==1.11.1 (from -r /home/user/piphash/requirements.txt (line 1))
  Using cached Django-1.11.1-py2.py3-none-any.whl
Collecting pytz==2017.2 (from -r /home/user/piphash/requirements.txt (line 2))
  Using cached pytz-2017.2-py2.py3-none-any.whl
THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    Django==1.11.1 from https://pypi.python.org/packages/2b/2c/019d6d5f7ed2889082ed96f849bf462c57265087a3a568a19b0d4c53bc55/Django-1.11.1-py2.py3-none-any.whl#md5=a300c34f63f1b5b1a57447b89fecbb85 (from -r /home/user/piphash/requirements.txt (line 1)):
        Expected sha256 bb3109a31cfa016e5f234223665f80fc06107f2169afb9f6dc8828295db73541
             Got        bb3109a31cfa016e5f234223665f80fc06107f2169afb9f6dc8828295db73547

    pytz==2017.2 from https://pypi.python.org/packages/55/62/e7cd0e15b76062d298413f14bb4ec3cd8568a22d274427f9c3c7286969f4/pytz-2017.2-py2.py3-none-any.whl#md5=84bcac3df0fdd58a9570a8404f3d933c (from -r /home/user/piphash/requirements.txt (line 2)):
        Expected sha256 d1d6729c85acea5423671382868627129432fba9a89ecbb248d8d1c7a9f01c61
             Got        d1d6729c85acea5423671382868627129432fba9a89ecbb248d8d1c7a9f01c67

As mentioned in previous installment, pip-tools excels at virtual environment package management. As it turns out, pip-compile can also be used to generate requirements files with included checksums. All we need to do is invoke it with the --generate-hashes option. For example, if we have a file called requirements.in with one line - django - in it, we can simply run:

pip-compile --generate-hashes requirements.in

At time of this writing, there is a bug in ``pip-tools`` (version ``1.9.0`` that prevents using ``--generate-hashes`` with Python 2.7.x.

This would give us requirements.txt that looks similar to:

#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile --generate-hashes --output-file requirements.txt requirements.in
#
django==1.11.1 \
    --hash=sha256:bb3109a31cfa016e5f234223665f80fc06107f2169afb9f6dc8828295db73547 \
    --hash=sha256:bbcefdf822eeef2cd04718ebcc24dd2ecf47407258cfcde2b4f95df57ce33a8c
pytz==2017.2 \
    --hash=sha256:d1d6729c85acea5423671382868627129432fba9a89ecbb248d8d1c7a9f01c67 \
    --hash=sha256:f5c056e8f62d45ba8215e5cb8f50dfccb198b4b9fbea8500674f3443e4689589 \
    # via django

Although an improvement, the hashing mechanism listed above does not address all the issues related to package integrity verification. If you ever decide to go down that route, keep in mind the following:

When initially downloading new or updated package, there is still no way to verify its integrity. E.g. the checksums only allow you to verify subsequent installation of packages.
In order to resolve dependencies, you still need to effectively run setup.py. E.g. upgrade checks could execute malicious code.

With all of this being said, we can only lament the fact that security behind package installation via pip is virtually non-existent. Introducing more security into the whole package distribution eco-system would require big changes in how packages are being handled, and it would require cooperation from all the contributors that upload their packages to PyPI.

For now, we'll all probably continue merily pip-installing packages on our servers and development machines, ignoring all the dangers out there. At least until the next big crack...

Happy hacking, and try not to let all of this disturb your sleep :)

April 2017

October 2017