Analyzing PyPI Download Statistics: from Zero to Half a Million Downloads in 9 Months

Remember the OpenCV packaging project for Python about which I wrote 9 months ago? Turns out it got a bit more popular than I expected. The popularity reflects directly to the issue and contributor count which in turn has provided a great opportunity to learn how to govern and maintain an open source project. However, to make decisions regarding the project, you will need data.

At the time of this writing, the packages have been downloaded in total around 450 000 times. On average, there has been about 1500 downloads per day. Considering the fact that OpenCV is a scientific library used by academia and developers who need computer vision, the demand for pre-built binary packages is not surprising. In fact, it's a bit odd that the developers of a library like OpenCV have not provided pre-built packages for one of the most popular scientific computing platforms out there. Reason for this is most likely lack of resources, which is understandable especially in the open source world.

There are still some important features missing from the binaries on GNU/Linux and MacOS: video I/O (FFmpeg) and window (cv2.imshow()) creation support. Windows packages have support for both. Keeping this in mind, the download rates could be a lot higher. This could be analyzed a bit further: how are the users behaving in reality? Some of you might already know that the download stats in PyPI are broken. However, there's another solution to get comprehensive statistical data for your package: Google BigQuery.

The support for BigQuery was announced last year and the existence of the data is not yet very well known. There are a few other blog posts about the dataset, for example Data Driven Decisions Using PyPI Download Statistics and Analyzing Plotly’s Python package downloads. The data is public and can be queried with a language similar to SQL.

Let's start with a query which will extract the distribution data of the packages over different operating systems. I will be using here two packages, since I added recently an alternative entry called opencv-contrib-python to PyPI. It includes also OpenCV contrib modules.

SELECT  
  details.system.name as distro,
  COUNT(*) as downloads,
FROM  
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -300, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
WHERE  
  file.project = 'opencv-python' OR file.project = 'opencv-contrib-python'
GROUP BY  
  distro,
ORDER BY  
  downloads DESC
LIMIT 100  

The query counts the downloads across different operating systems over the past 300 days. The output looks like this:

OS Downloads
Linux 185021
Unknown 149425
Darwin 60119
Windows 55214

And a plotted version:

opencv-python operating systems

It was a bit surprising that most of the downloads are for Linux despite the fact that Windows has the most feature complete packages. I guess I'll have to focus more on the Linux ones. There's also the "Unknown" OS which might be any of the other known operating systems, so in reality the downloads might be distributed a bit differently.

The next query looks like this (source):

SELECT  
  REGEXP_EXTRACT(details.python, r"^([^\.]+\.[^\.]+)") as python_version,
  COUNT(*) as downloads,
FROM  
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -300, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
WHERE  
  file.project = 'opencv-python' OR file.project = 'opencv-contrib-python'
GROUP BY  
  python_version,
ORDER BY  
  downloads DESC
LIMIT 100  

The query works in same way as the first one but lists the popularity of different Python versions. This query is interesting because Python 2.7 will be "soon" (in 2020) legacy, but still many Python programmers are using it. The data of this query is plotted below.

python versions

Python 2.7 has the biggest share, nothing new there. The unknown factor is present in this query too. Interestingly, Python 3.x seems to be rather popular and programmers have clearly already moved away from version 3.4. I used this query in January to analyze if I could drop Python 3.3 support. The decision was easy: the download count for 3.3 was below 100.

The last query will tell the actual download counts grouped by date. This was done separately for the two packages.

SELECT  
  STRFTIME_UTC_USEC(timestamp, "%Y-%m-%d") AS yyyymmdd,
  COUNT(*) as total_downloads,
FROM  
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -300, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
WHERE  
  file.project = 'opencv-python'
GROUP BY  
  yyyymmdd
ORDER BY  
  yyyymmdd DESC

The data of this query was used to create the two plots below (open the images in new tabs to see bigger versions). All the new releases can be clearly seen as spikes in the first plot. Related spikes have been annotated. I have no idea which has caused the last two spikes, since there has been no new releases and I haven't advertised the packages anywhere. Otherwise the download count has been steadily increasing every day which can be seen also from the shape of the cumulative line. On the other hand, opencv-contrib-python package is not yet very popular.

opencv-python downloads

cumulative downloads

Statistics can provide unique insight on how to develop your open source software project further. For me it's now clear that I have to implement better support for GNU/Linux because majority of the users are there. What comes to PyPI and Google BigQuery, it would be great to see generic Python/PyPI statistics somewhere with daily updates. It would be particularly interesting to see Python 3 adoption rate graphs :)