Skip to main content

A library to create a bot / spider / crawler.

Project description

Exoskeleton

Build System Test Supported Python Versions Last commit pypi version Downloads

For my dissertation I downloaded hundreds of thousands of documents and feed them into a machine learning pipeline. Using a high-speed-connection is helpful but carries the risk to run an involuntary denial-of-service attack on the servers that provide those documents. This creates a need for a crawler / scraper that avoids too high loads on the connection and instead runs permanently and fault tolerant to ultimately download all files.

Exoskeleton is a python framework that aims to help you build a similar bot. Main functionalities are:

  • Managing a download queue within a MariaDB database.
  • Avoid processing the same URL more than once.
  • Working through that queue by either
    • downloading files to disk,
    • storing the page source code into a database table,
    • storing the page text,
    • or making PDF-copies of webpages.
  • Managing already downloaded files:
    • Storing multiple versions of a specific file.
    • Assigning labels to downloads, so they can be found and grouped easily.
  • Sending progress reports to the admin.

Exoskeleton has an extensive documentation.

Two other python libraries were created as part of this project:

  • userprovided : check user input for validity and plausibility / covert input into better formats
  • bote : send messages (currently via a local or remote SMTP server)

Example

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import logging

import exoskeleton

logging.basicConfig(level=logging.DEBUG)

# Create a bot
# exoskeleton makes reasonable assumptions about
# parameters left out, like:
# - host = localhost
# - port = 3306 (MariaDB standard)
# - ...
exo = exoskeleton.Exoskeleton(
    project_name='Bot',
    database_settings={'database': 'exoskeleton',
                       'username': 'exoskeleton',
                       'passphrase': ''},
    # True, to stop after the queue is empty, Otherwise it will
    # look consistently for new tasks in the queue:
    bot_behavior={'stop_if_queue_empty': True},
    filename_prefix='bot_',
    chrome_name='chromium-browser',
    target_directory='/home/myusername/myBot/'
)

exo.add_file_download('https://www.ruediger-voigt.eu/examplefile.txt')
# => Will be saved in the target directory. The filename will be the
#    chosen prefix followed by the database id and .txt.

exo.add_file_download(
    'https://www.ruediger-voigt.eu/examplefile.txt',
    {'example-label', 'foo'})
# => Duplicate will be recognized and not added to the queue,
#    but the labels will be associated with the file in the
#    database.


exo.add_file_download(
    'https://www.ruediger-voigt.eu/file_does_not_exist.pdf')
# => Nonexistent file: will be marked, but will not stop the bot.

# Save a page's code into the database:
exo.add_save_page_code('https://www.ruediger-voigt.eu/')

# Use chromium or Google chrome to generate a PDF of the website:
exo.add_page_to_pdf('https://github.com/RuedigerVoigt/exoskeleton')

# work through the queue:
exo.process_queue()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exoskeleton-1.2.1.tar.gz (28.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

exoskeleton-1.2.1-py3-none-any.whl (37.4 kB view details)

Uploaded Python 3

File details

Details for the file exoskeleton-1.2.1.tar.gz.

File metadata

  • Download URL: exoskeleton-1.2.1.tar.gz
  • Upload date:
  • Size: 28.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.5

File hashes

Hashes for exoskeleton-1.2.1.tar.gz
Algorithm Hash digest
SHA256 e81bd95f91d339be24049c029b5eb6789c52bfdf431de842c025e294bd5e8014
MD5 96e12d0c77ab76994c7af1b3ea1bed89
BLAKE2b-256 05a5340ef762b7f727e6dc1b7cb30dd15bc376d1f0b2b9edc41bda6efa29e074

See more details on using hashes here.

File details

Details for the file exoskeleton-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: exoskeleton-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 37.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.5

File hashes

Hashes for exoskeleton-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cb5b34643361186f81c00ddc533ac36c820f861b7e0c811456c406f4874efbe0
MD5 fb13d2d6b0231cdaa0bdce922c0bed5e
BLAKE2b-256 bb77bf32ebfd0abb5fc460d47193e6e1b2b59faf6120a17696ad838e25bb4d32

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page