Skip to main content

Convert WARC to ZIM

Project description

warc2zim

CI codecov CodeFactor

warc2zim provides a way to convert WARC files to ZIM, storing the WARC payload and WARC+HTTP headers separately.

Additionally, the ReplayWeb.page is also added to the ZIM, creating a self-contained ZIM that can render its content in a modern browser.

Usage

Example:

warc2zim ./path/to/myarchive.warc --output /output --name myarchive.zim -u https://example.com/

The above will create a ZIM file /output/myarchive.zim with https://example.com/ set as the main page.

Installation

python3 -m venv ./env  # creates a virtual python environment in ./env folder
./env/bin/pip install -U pip  # upgrade pip (package manager). recommended
./env/bin/pip install -U warc2zim  # install/upgrade warc2zim inside virtualenv

# direct access to in-virtualenv warc2zim binary, without shell-attachment
./env/bin/warc2zim --help

# alternatively, attach virtualenv to shell
source env/bin/activate
warc2zim --help
deactivate  # unloads virtualenv from shell

URL Filtering

By default, only URLs from domain of the main page and subdomains are included, eg. only *.example.com urls in the above example.

This allows for filtering out URLs that may be out of scope (eg. ads, social media trackers).

To specify a different top-level domain, use the --include-domains/ -i flag for each domain, eg. if main page is on a subdomain, https://subdomain.example.com/ but all URLs from *.example.com should be included, use:

warc2zim myarchive.warc --name myarchive -i example.com -u https://subdomain.example.com/starting/page.html

To simply include all urls, use the --include-all / -a flag:

warc2zim myarchive.warc --name myarchive -a -u https://someother.example.com/page.html

Custom CSS

--custom-css allows passing an URL or a path to a CSS file that gets added to the ZIM and gets included on every HTML article at the very end of </head> (if it exists).

See warc2zim -h for other options.

ZIM Entry Layout

The WARC to ZIM conversion is performed by splitting the WARC (and HTTP) headers from the payload.

For response records, the WARC + HTTP headers are stored under H/<url> while the payload is stored under A/<url>

For resource records, the WARC headers are stored under H/<url> while the payload is stored under A/<url>. (Three are no HTTP headers for resource records).

For revisit records, the WARC + optional HTTP headers are stored under H/<url>, while no payload record is created.

If the payload A/<url> is zero-length, the record is omitted to conform to ZIM specifications of not storing empty records.

Duplicate URIs

WARCs allow multiple records for the same URL, while ZIM does not. As a result, only the first encountered response or resource record is stored in the ZIM, and subsequent records are ignored.

For revisit records, they are only added if pointing to a different URL, and are processed after response/revisit records. A revisit record to the same URL will always be ignored.

All other WARC records are skipped.

i18n

warc2zim has very minimal non-content text but still uses gettext through babel to internationalize.

To add a new locale (fr in this example, use only ISO-639-1):

  1. init for your locale: python setup.py init_catalog -l fr
  2. make sure the POT is up to date python setup.py extract_messages
  3. update your locale's catalog python setup.py update_catalog
  4. translate the PO file (poedit is your friend)
  5. compile updated translation python setup.py compile_catalog

License

GPLv3 or later, see LICENSE for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

warc2zim-1.5.3.tar.gz (357.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

warc2zim-1.5.3-py3-none-any.whl (355.0 kB view details)

Uploaded Python 3

File details

Details for the file warc2zim-1.5.3.tar.gz.

File metadata

  • Download URL: warc2zim-1.5.3.tar.gz
  • Upload date:
  • Size: 357.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for warc2zim-1.5.3.tar.gz
Algorithm Hash digest
SHA256 faf467756084bcd4527202d99bf58e2496d38459d41f1c75ade9ea62f906915b
MD5 1eb34e7eb95dc7464b8389b2922c1a6f
BLAKE2b-256 2f2dd726c1cf2dc3fcf0b89f8249c5fcb5a155ffa3916fcaf08a62d63b39ac92

See more details on using hashes here.

File details

Details for the file warc2zim-1.5.3-py3-none-any.whl.

File metadata

  • Download URL: warc2zim-1.5.3-py3-none-any.whl
  • Upload date:
  • Size: 355.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for warc2zim-1.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 22b5e83b4f6e8e4bcf8f5aac7e8716c3ee8229bbc63ca7e4dba92e2fbbf6d527
MD5 79fed88b9243072cbe487e0b2498d674
BLAKE2b-256 deeb0568eb46542ee372fb22fd8c8c5a90b413c73e4fd939a52a568fcbcd6fd9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page