Remove badges and other content from GitHub READMEs before publishing a Python package to PyPI

Remove badges and other content from GitHub READMEs before publishing a Python package to PyPI

Do you want your PyPI readme to be the project readme, but without badges?

I recently listened to an episode about Tools for README.md Creation and Maintenance on the Talk Python to Me Podcast, where Michael Kennedy and Ned Betchelder talked about hatch-fancy-pypi-readme.

This plugin for the Hatch Python project manager converts GitHub project README.md files into "fancy" PyPI project landing pages, removing dynamic and irrelevant content before including it in the Python package. This allows you to include the GitHub README.md as the readme in your pyproject.toml (or long_description if you're using setup.cfg). In their own words:

Do you want your PyPI readme to be the project readme, but without badges, followed by the license file, and the changelog section for only the last release?

Sure, I want that, in particular the first part. But I didn't want to switch to Hatch just for this feature.

Since I'm using GitHub Actions for my project, I instead added a README preprocessing step to my release workflow. Since Google didn't surface anything when I initially searched for a solution like this, and I had to spend some time fiddling with sed and awk to arrive at a clean solution, I'm documenting it here.

Marking content to exclude for PyPI

Since GitHub-flavored Markdown allows to comment out content, we can use comments to wrap the parts of the content we want to exclude:

# Title of the README

This content will go into the built package.

<!-- EXCLUDE -->
This content will not be added to the package.
<!-- /EXCLUDE -->

Removing excluded content

To process the README.md before building the package, we'll use awk. This powerful Unix tool is available on virtually every Linux VM out there. It can apply one-liners of the following form to every line of the input: condition { action if true } { action if false }. condition can be a regular expression.

We can use a range expression to remove the content wrapped with EXCLUDE "tags":

awk '/<!-- EXCLUDE -->/,/<!-- \/EXCLUDE -->/ { next } { print }'

Read this as '/start pattern/,/end pattern/ { match } { no match }'.

As this might lead to multiple consecutive empty lines, which is usually undesirable, we can clean this up with awk as well:

awk '{ /^\s*$/ ?b++:b=0; if (b<=1) print }'

Here, there is no condition. Thus, the action will be performed for every line of the input. It checks if the line contains only whitespace. If that's true, a variable b is incremented by one. Otherwise, b is reset to 0. A line is only printed if b <= 1, i.e., it is not a blank line following a blank line.

Adding everything to the GitHub Actions release workflow

The README preprocessing step goes between the step that checks out the source code and the step that calls python -m build:

- name: Remove irrelevant parts from README
  run: |
    # Delete everything between EXCLUDE-"tags"
    awk '/<!-- EXCLUDE -->/,/<!-- \/EXCLUDE -->/ { next } { print }' README.md > tmp_readme
    rm README.md
    # Allow at most single blank lines
    awk '{ /^\s*$/ ?b++:b=0; if (b<=1) print }' tmp_readme > README.md

And that's it.