Scrapy WARC I/O
Project description
Scrapy Warcio
A Web Archive WARC I/O module for Scrapy
Install
$ pip install scrapy_warcio
Usage
- Copy and edit
scrapy_warcio
distributedsettings.yml
with your configuration settings:
---
warc_spec: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
max_warc_size: 10000000000 # 10GB
collection: ~ # collection name
description: ~ # collection description
operator: ~ # operator email address
robots: ~ # robots policy
user_agent: ~ # your user-agent
warc_prefix: ~ # WARC filename prefix
warc_dest: ~ # WARC files destination
...
-
Export
SCRAPY_WARCIO_SETTINGS='/path/to/settings.yml'
-
Enable
DownloaderMiddlewares
in<spider>/<spider>/settings.py
-
Use
scrapy_warcio
methods in<spider>/<spider>/middlewares.py
:
import scrapy_warcio
class <spider>DownloaderMiddlewares:
def __init__(self):
self.warcio = scrapy_warcio.ScrapyWarcIo()
def process_request(self, request, spider):
# set WARC-Date for both request and response
request.meta['WARC-Date'] = scrapy_warcio.warc_date()
spider.logger.info('warcio request: %s', request.url)
return None
def process_response(self, request, response, spider):
# write response and request
self.warcio.write_response(response)
spider.logger.info('warcio response: %s', response.url)
spider.logger.info('warc_count: %s', self.warcio.warc_count)
spider.logger.info('warc_fname: %s', self.warcio.warc_fname)
spider.logger.info('warc_size: %s', self.warcio.warc_size)
return response
- Upload your Scrapy WARCs to your favorite archive!
Help
$ pydoc scrapy_warcio
or
>>> help(scrapy_warcio)
TODO
Making this a Scrapy extension may make it more useful: https://docs.scrapy.org/en/latest/topics/extensions.html
@internetarchive
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrapy_warcio-0.0.1.tar.gz
(4.7 kB
view hashes)
Built Distribution
Close
Hashes for scrapy_warcio-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f0efb964dafe9494854928c6890a0678ee51c046a69332cbd3842eb5948f00f |
|
MD5 | f2d36e52bad7e2188fa98dc7a24241e8 |
|
BLAKE2b-256 | 73970ee13c12463da505a5e57462acda478f9cde5f830b86a951109e72ba851b |