Skip to main content

YAML based lightweight crawlers

Project description

# Skyscraper

YAML based lightweight crawlers


## Usage


Each web crawler is defined in a yml file


# the name of the crawler
name: Python 3.x docs
# the number of parallel thread workers
threads: 3

# start urls
params:
start_url: https://docs.python.org/3/index.html

# how/where the results are saved
results:
type: Json
file: "python.json"

# on each url labeled "result", results will be extracted using
# this scheme
result_extractor:
fields:
- name: title
rules:
select: h1
text: yes
single: true


# the first page is labeled "start" and for each extracted url, we label it
# accordingly. In this example, we extract the results directly from
# the first page
steps:
- name: start
label: start
extract:
- type: ahrefs
label: result
rules:
select: a.biglink


To run the crawler, execute

skyscraper run examples/python_docs.yaml


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skyscraper-0.0.3.tar.gz (4.2 kB view hashes)

Uploaded Source

Built Distribution

skyscraper-0.0.3-py3-none-any.whl (6.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page