State of Version Control System of Python modules: 32.95% no link to VCS found

According to the PyDigger stats I've been collecting for a while, 32.95% of the modules on PyPI don't have a reference to their GitHub repository. That's 22,575 modules (out of the 68,514 I've indexed so far).

The about page now describes that the information about version control system is extracted from the JSON file provided for each module by PyPI and that it can only recognize GitHub.

There is a link from the stats page listing Python projects without GitHub link. They are ordered by the date they were released to PyPI, the most recent first.

I went over a few to see how can PyDigger be improved, or how can the data provided by the module authors improved.

I am picking a few of the most recent modules. Not to blame them, but to try to think aloud what could be improved.

20 most recently released modules without the GitHub link

There is a link to GitHub, but it leads to the GitHub user and not the project itself. netboy has home_page linking to https://github.com/pingf. Same with catenae, xspkg_songzviewer

pyqlearning "home_page" has a link to Reinforcement-Learning a subdirectory in the GitHub repository. Is it because this repository contains more than one PyPI modules?

justpith has a link in "home_page" to http://packages.python.org/justpith that leads to https://pythonhosted.org/justpith which is actually 404.

No link to any VCS: nesterfeng, jumper, Symbolic, FLUIDAsserts aliyun-python-sdk-cloudphoto, stylelens-product, pylinda, and sqreen (proprietary license).

jsnapy links to www.gihub.com that redirects to the same page on github.com. The regex in PyDigger only accepts the link if it is directly to github.com.

miceshare "home_page" leads to GitLab https://gitlab.com/SmartChinaStock/stock_decision_center that first required me to log in and then once I was logged in it was 404.

Clearly PyDigger should be able to recognize GitLab links as well, but that would still not help in the case of this module.

coloredlogs the link to the GitHub repository is mentioned in the description. I think it should be in some designated field. Same with uplink.

mne had the link to its GitHub repo in the "download_url" field which is currently not checked by PyDigger.

fastFM the "home_page" leads to GitHub page of the project and not to the repository.

modelhub "home_page" leads to Git repo on some other site.

Digging in the database

I tried to look around the home_page field of the data we have looking for two other popular version control hosting solutions:


  db.packages.find({ "home_page" : /gitlab/i }).count();
  734

  db.packages.find({ "home_page" : /bitbucket/ }, { home_page: 1 }).count()
  1611

So once those are indexed properly we'll have a better % of modules with links to their VCS, by 3-4%.

Conclusion

There is no designated field in the JSON file provided by PyPI for "Version Control System". We cannot assume that the "home_page" will contain link to the VCS. We can start recognizing other well known VCS-es as well, (e.g. GitLab, Bitbucket), but some projects use their own Git repository hosting. Some of these might be public. Other might be private.

I think the best would be to have a designated field for the VCS.