Reducing the size of the botocore package

Reducing the size of the botocore package

botocore is a fairly large Python package. Until version 1.31.x, it came in at close to 90 MB. Since version 1.32.1, which was released in November of 2023, this dropped to around 25 MB when the AWS developers started to compress files.

Specifically, they compress the service definitions in the package's data folder, which are JSON files. botocore carries around the full API specifications of all AWS services, whether you need them or not.

With these API definitions still making up the bulk of the botocore package, it suggests itself to remove unused files from the data folder to reduce the amount of disk spaced consumed.

This is particularly helpful when looking for ways to shrink the size of installation bundles, such as Lambda packages. (Note that AWS recommends to bundle botocore and boto3 with your Lambda function code, even if they provide an up-to-date version in the Lambda runtime environment.)

However, we have to be careful not to remove files that our program does depend on. Here's a small utility script that detects the AWS services a Python program uses and prunes a botocore package accordingly:

import argparse
import pathlib
import re
import shutil


PATTERN = re.compile(
    "boto3\\.(?:client|resource)\\([\\\"'](\\w+)[\\\"'](?:\\)|\\s*,)"
)


def prune_botocore(
    source_path: pathlib.Path,
    botocore_path: pathlib.Path,
    keep: list[str],
):
    for file in source_path.rglob("*.py"):
        print(f"Checking {file}")
        with open(file, "r") as f:
            for line in f:
                if (
                    match := re.search(
                        PATTERN,
                        line,
                    )
                ) is not None:
                    keep.append(match.group(1))
    keep = set(keep)

    print("Keeping the following services:", sorted(keep))
    for folder in botocore_path.rglob("data/*"):
        if folder.is_dir() and folder.name not in keep:
            shutil.rmtree(folder)


if __name__ == "__main__":
    arg_parser = argparse.ArgumentParser()
    arg_parser.add_argument(
        "--source",
        type=pathlib.Path,
        help="Location of the main source code.",
    )
    arg_parser.add_argument(
        "--botocore",
        type=pathlib.Path,
        help="Location of the botocore package.",
    )
    arg_parser.add_argument(
        "--keep",
        nargs="+",
        help="Services to keep even if not referenced in source code.",
        default=[],
    )
    args = arg_parser.parse_args()

    prune_botocore(
        source_path=args.source,
        botocore_path=args.botocore,
        keep=args.keep,
    )

It finds all instances where a boto3 client or resource is instantiated using the following regular expression:

boto3\.(?:client|resource)\([\"'](\w+)[\"'](?:\)|\s*,)

Then, it removes all other services from the data folder.

It's also possible to specify which services should be kept beyond those found in the code. For example, in the case of a Lambda function that uses the X-Ray wrapper of Powertools for AWS Lambda, we have to keep the xray client even though we don't reference it directly in our code:

python prune_botocore.py --source package/my-function-with-tracing \
                         --botocore package/botocore \
                         --keep xray