Skip to main content

A library to create a bot / spider / crawler.

Project description

Exoskeleton

pypi version Supported Python Versions Build Last commit Downloads Coverage

Machine Learning and other applications make it necessary to download thousands or sometimes hundreds of thousands of files.

Using a high-speed-connection carries the risk to run an involuntary denial-of-service attack on the servers that provide those files and webpages.

Exoskeleton is a Python framework that helps you build a crawler / scraper that avoids too high loads on the connection and instead runs permanently and fault tolerant to ultimately download all files.

Its main functionalities are:

  • Managing the download queue and document data within a MariaDB database.
  • Avoid processing the same URL more than once.
  • Working through the queue by either
    • downloading files to disk,
    • storing the page source code into a database table,
    • storing the page text,
    • or making PDF-copies of webpages.
  • Managing already downloaded files:
    • Storing multiple versions of a specific file.
    • Assigning labels to downloads, so they can be found and grouped easily.
  • Sending progress reports to the admin.

Documentation

How To Use Exoskeleton

Example Uses

  • Downloading an Archive : A quite complex use case requiring some custom SQL. This is the actual project that triggered the development of exoskeleton.

Technical Documentation

Example

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import logging

import exoskeleton

logging.basicConfig(level=logging.DEBUG)

# Create a bot
# exoskeleton makes reasonable assumptions about
# parameters left out, like:
# - host = localhost
# - port = 3306 (MariaDB standard)
# - ...
exo = exoskeleton.Exoskeleton(
    project_name='Bot',
    database_settings={'database': 'exoskeleton',
                       'username': 'exoskeleton',
                       'passphrase': ''},
    # True, to stop after the queue is empty, Otherwise it will
    # look consistently for new tasks in the queue:
    bot_behavior={'stop_if_queue_empty': True},
    filename_prefix='bot_',
    chrome_name='chromium-browser',
    target_directory='/home/myusername/myBot/'
)

exo.add_file_download('https://www.ruediger-voigt.eu/examplefile.txt')
# => Will be saved in the target directory. The filename will be the
#    chosen prefix followed by the database id and .txt.

exo.add_file_download(
    'https://www.ruediger-voigt.eu/examplefile.txt',
    {'example-label', 'foo'})
# => Duplicate will be recognized and not added to the queue,
#    but the labels will be associated with the file in the
#    database.


exo.add_file_download(
    'https://www.ruediger-voigt.eu/file_does_not_exist.pdf')
# => Nonexistent file: will be marked, but will not stop the bot.

# Save a page's code into the database:
exo.add_save_page_code('https://www.ruediger-voigt.eu/')

# Use chromium or Google chrome to generate a PDF of the website:
exo.add_page_to_pdf('https://github.com/RuedigerVoigt/exoskeleton')

# work through the queue:
exo.process_queue()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exoskeleton-2.0.0.tar.gz (28.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

exoskeleton-2.0.0-py3-none-any.whl (41.0 kB view details)

Uploaded Python 3

File details

Details for the file exoskeleton-2.0.0.tar.gz.

File metadata

  • Download URL: exoskeleton-2.0.0.tar.gz
  • Upload date:
  • Size: 28.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.25.1 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.10

File hashes

Hashes for exoskeleton-2.0.0.tar.gz
Algorithm Hash digest
SHA256 a9d0af625ac9d2ba63b559a016f1a5f1bb3449d1818c1be5b719e492fd57bd66
MD5 2d1f0aa33d905a6cac1533461cfc8767
BLAKE2b-256 d727729bbda62cc0a87b844a973c1d6f1d9bd566fb21a37e04fb349de91fbf0a

See more details on using hashes here.

File details

Details for the file exoskeleton-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: exoskeleton-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 41.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.25.1 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.10

File hashes

Hashes for exoskeleton-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5e8ef203e97534939696a0b13b6f68e52b6cbde8d93b91b4e10274fff21e0a73
MD5 f9251bc71b066de6b7d590527eb7c7e8
BLAKE2b-256 431a592f3b9002f5d4738ae05c0e8baec5c9bccbf33872e1b676f4e4adf764bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page