Convert WARC to ZIM
Project description
warc2zim
warc2zim provides a way to convert WARC files to ZIM, storing the WARC payload and WARC+HTTP headers separately.
Additionally, the ReplayWeb.page is also added to the ZIM, creating a self-contained ZIM that can render its content in a modern browser.
Usage
Example:
warc2zim ./path/to/myarchive.warc --output /output --name myarchive.zim -u https://example.com/
The above will create a ZIM file /output/myarchive.zim with https://example.com/ set as the main page.
URL Filtering
By default, only URLs from domain of the main page and subdomains are included, eg. only *.example.com urls in the above example.
This allows for filtering out URLs that may be out of scope (eg. ads, social media trackers).
To specify a different top-level domain, use the --include-domains/ -i flag for each domain, eg. if main page is on a subdomain, https://subdomain.example.com/ but all URLs from *.example.com should be included, use:
warc2zim myarchive.warc --name myarchive -i example.com -u https://subdomain.example.com/starting/page.html
To simply include all urls, use the --include-all / -a flag:
warc2zim myarchive.warc --name myarchive -a -u https://someother.example.com/page.html
See warc2zim -h for other options.
ZIM Entry Layout
The WARC to ZIM conversion is performed by splitting the WARC (and HTTP) headers from the payload.
For response records, the WARC + HTTP headers are stored under H/<url> while the payload is stored under A/<url>
For resource records, the WARC headers are stored under H/<url> while the payload is stored under A/<url>. (Three are no HTTP headers for resource records).
For revisit records, the WARC + optional HTTP headers are stored under H/<url>, while no payload record is created.
If the payload A/<url> is zero-length, the record is omitted to conform to ZIM specifications of not storing empty records.
Duplicate URIs
WARCs allow multiple records for the same URL, while ZIM does not. As a result, only the first encountered response or resource record is stored in the ZIM, and subsequent records are ignored.
For revisit records, they are only added if pointing to a different URL, and are processed after response/revisit records. A revisit record to the same URL will always be ignored.
All other WARC records are skipped.
i18n
warc2zim has very minimal non-content text but still uses gettext through babel to internationalize.
To add a new locale (fr in this example, use only ISO-639-1):
- init for your locale:
python setup.py init_catalog -l fr - make sure the POT is up to date
python setup.py extract_messages - update your locale's catalog
python setup.py update_catalog - translate the PO file (poedit is your friend)
- compile updated translation
python setup.py compile_catalog
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file warc2zim-1.3.3.tar.gz.
File metadata
- Download URL: warc2zim-1.3.3.tar.gz
- Upload date:
- Size: 333.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
275bdd6fb58a6dce206df19487eff186d6bc97c4961a1df9cd8e78c941d4057c
|
|
| MD5 |
86a3f0e58beee1020b68e3f6ba8d59d7
|
|
| BLAKE2b-256 |
d5f371f37776c51d0199aa3c45fabc0458efcdc68ae4495856ae4a066a1cbbd8
|
File details
Details for the file warc2zim-1.3.3-py3-none-any.whl.
File metadata
- Download URL: warc2zim-1.3.3-py3-none-any.whl
- Upload date:
- Size: 344.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfb33afa59a4c58bcc77602707ab848e85ccad79a58f3b9d7b0a07475b38b568
|
|
| MD5 |
817a8931e9227768819aa8921ed874c8
|
|
| BLAKE2b-256 |
a35a02515010b76ec683390fec0900fa6719a93471bb3c03ab5dc69a8cfe281d
|