Skip to main content

A multi-threaded and multi-source aria2-like batch file downloading library for Python

Project description

bdownload

Latest Version Build Status Supported Versions

A multi-threaded and multi-source aria2-like batch file downloading library for Python 2.7 and 3.6+

:bulb:
See also https://bdownload.readthedocs.io for API reference.

Installation

  • via PyPI

    pip install bdownload

  • from within source directory locally

    pip install .

    Note that you should git clone or download the source tarball (and unpack it of course) from the repository first

:bulb:
For Python2.7: since the version of 2022.5.18, certifi has dropped the support for Python2.x. To upgrade to the latest CA certificates bundle, simply run:

$ bdownload-upd-cacert-py2

Usage: as a Python package

Importing

    from bdownload import BDownloader, BDownloaderException

            or

    import bdownload

Signatures

class bdownload.BDownloader(max_workers=None, min_split_size=1024*1024, chunk_size=1024*100, proxy=None, cookies=None, user_agent=None, logger=None, progress='mill', num_pools=20, pool_maxsize=20, request_timeout=None, request_retries=None, status_forcelist=None, resumption_retries=None, continuation=True, referrer=None, check_certificate=True, ca_certificate=None, certificate=None)

    Create and initialize a BDownloader object for executing download jobs.

  • The max_workers parameter specifies the number of the parallel downloading threads, whose default value is determined by #num_of_processor * 5 if set to None.

  • min_split_size denotes the size in bytes of file pieces split to be downloaded in parallel, which defaults to 1024*1024 bytes (i.e. 1MB).

  • The chunk_size parameter specifies the chunk size in bytes of every http range request, which will take a default value of 1024*100 (i.e. 100KB) if not provided.

  • proxy supports both HTTP and SOCKS proxies in the form of http://[user:pass@]host:port and socks5://[user:pass@]host:port, respectively.

  • If cookies needs to be set, it must either take the form of cookie_key=cookie_value, with multiple pairs separated by whitespace and/or semicolon if applicable, e.g. 'key1=val1 key2=val2;key3=val3', be packed into a dict, or be an instance of CookieJar, i.e. cookielib.CookieJar for Python27, http.cookiejar.CookieJar for Python3.x or RequestsCookieJar from requests.

    Note that the ValueError exception will be raised when the cookies is of the str type and not in the valid format.

  • When user_agent is not given, it will default to 'bdownload/VERSION', with VERSION being replaced by the package's version number.

  • The logger parameter specifies an event logger. If logger is not None, it must be an object of class logging.Logger or of its customized subclass. Otherwise, it will use a default module-level logger returned by logging.getLogger(__name__).

  • progress determines the style of the progress bar displayed while downloading files. Possible values are 'mill', 'bar' and 'none''. 'mill' is the default. To disable this feature, e.g. while scripting or multi-instanced, set it to 'none'.

  • The num_pools parameter has the same meaning as num_pools in urllib3.PoolManager and will eventually be passed to it. Specifically, num_pools specifies the number of connection pools to cache.

  • pool_maxsize will be passed to the underlying requests.adapters.HTTPAdapter. It specifies the maximum number of connections to save that can be reused in the urllib3 connection pool.

  • The request_timeout parameter specifies the timeouts for the internal requests session. The timeout value(s) as a float or (connect, read) tuple is intended for both the connect and the read timeouts, respectively. If set to None, it will take a default value of (3.05, 6).

  • request_retries specifies the maximum number of retry attempts allowed on exceptions and interested status codes (i.e. status_forcelist) for the builtin Retry logic of urllib3. It will default to download.URLLIB3_BUILTIN_RETRIES_ON_EXCEPTION if not given.

    NB: There are two retry mechanisms that jointly determine the total retries of a request. One is the above-mentioned Retry logic that is built into urllib3, and the other is the extended high-level retry factor that is meant to complement the builtin retry mechanism. The total retries is bounded by the following formula: request_retries * (requests_extended_retries_factor + 1), where requests_extended_retries_factor can be modified through the module level function bdownload.set_requests_retries_factor(), and is initialized to download.REQUESTS_EXTENDED_RETRIES_FACTOR by default; Usually you don't want to change it.

  • status_forcelist specifies a set of HTTP status codes that a retry should be enforced on. The default set of status codes shall be download.URLLIB3_RETRY_STATUS_CODES if not given.

  • The resumption_retries parameter specifies the maximum allowable number of retries on error at resuming the interrupted download while streaming the request content. The default value of it is download.REQUESTS_RETRIES_ON_STREAM_EXCEPTION when not provided.

  • The continuation parameter specifies whether, if possible, to resume the partially downloaded files before, e.g. when the downloads had been terminated by the user by pressing Ctrl-C. When not present, it will default to True.

  • referrer specifies an HTTP request header Referer that applies to all downloads. If set to '*', the request URL shall be used as the referrer per download.

  • The check_certificate parameter specifies whether to verify the server's TLS certificate or not. It defaults to True.

  • ca_certificate specifies a path to the preferred CA bundle file (.pem) or directory with certificates in PEM format of trusted CAs. If set to a path to a directory, the directory must have been processed using the c_rehash utility supplied with OpenSSL, according to requests. NB the cert files in the directory each only contain one CA certificate.

  • certificate specifies a client certificate. It has the same meaning as that of cert in requests.request().

BDownloader.downloads(path_urls)

    Submit multiple downloading jobs at a time.

  • path_urls accepts a list of tuples of the form (path, url), where path should be a pathname, probably prefixed with absolute or relative paths, and url should be a URL string, which may consist of multiple TAB-separated URLs pointing to the same file. A valid path_urls, for example, could be [('/opt/files/bar.tar.bz2', 'https://foo.cc/bar.tar.bz2'), ('./sanguoshuowen.pdf', 'https://bar.cc/sanguoshuowen.pdf\thttps://foo.cc/sanguoshuowen.pdf'), ('/to/be/created/', 'https://flash.jiefang.rmy/lc-cl/gaozhuang/chelsia/rockspeaker.tar.gz'), ('/path/to/existing-dir', 'https://ghosthat.bar/foo/puretonecone81.xz\thttps://tpot.horn/foo/puretonecone81.xz\thttps://hawkhill.bar/foo/puretonecone81.xz')].

    Note that BDownloaderException will be raised if the downloads were interrupted, e.g. by calling BDownloader.cancel() in a SIGINT signal handler, in the process of submitting the download requests.

:warning:
The method is not thread-safe, which means it should not be called at the same time in multiple threads with one instance.

When multi-instanced (e.g. one instance per thread), the file paths specified in one instance should not overlap those in another to avoid potential race conditions. File loss may occur, for example, if a failed download task in one instance tries to delete a directory that is being accessed by some download tasks in other instances. However, this limitation doesn't apply to the file paths specified in a same instance.

BDownloader.download(path, url)

    Submit a single downloading job.

  • Similar to BDownloader.downloads(), in fact it is just a special case of which, with [(path, url)] composed of the specified parameters as the input.

    Note that BDownloaderException will be raised if the download was interrupted, e.g. by calling BDownloader.cancel() in a SIGINT signal handler, in the process of submitting the download request.

:warning:
The limitation on the method and the path_name parameter herein is the same as in BDownloader.downloads().

BDownloader.wait_for_all()

    Wait for all the downloading jobs to complete. Returns a 2-tuple of lists (succeeded, failed). The first list succeeded contains the originally passed (path, url)s that completed successfully, while the second list failed contains the raised and cancelled ones.

BDownloader.results()

    Get both the succeeded and failed downloads when all done or interrupted by user. Return a 2-tuple of list same as that returned by BDownloader.wait_for_all().

BDownloader.result()

    Return the final download status. 0 for success, and -1 failure.

BDownloader.close()

    Shut down and perform the cleanup.

BDownloader.cancel(keyboard_interrupt=True)

    Cancel all the download jobs.

  • keyboard_interrupt specifies whether the user hit the interrupt key (e.g. Ctrl-C).

bdownload.set_requests_retries_factor(retries)

    Set the retries factor that complements and extends the builtin retry mechanism of urllib3.

  • The retries parameter specifies the maximum number of retries when a decorated method of requests raised an exception or returned any bad status code. It should take a value of at least 1, or else nothing changes.

Examples

import unittest
import tempfile
import os
import hashlib

from bdownload import BDownloader


class TestBDownloader(unittest.TestCase):
    def setUp(self):
        self.tmp_dir = tempfile.TemporaryDirectory()

    def tearDown(self):
        self.tmp_dir.cleanup()

    def test_bdownloader_download(self):
        file_path = os.path.join(self.tmp_dir.name, "aria2-x86_64-win.zip")
        file_url = "https://github.com/Jesseatgao/aria2-patched-static-build/releases/download/1.35.0-win-linux/aria2-x86_64-win.zip"
        file_sha1_exp = "16835c5329450de7a172412b09464d36c549b493"

        with BDownloader(max_workers=20, progress='mill') as downloader:
            downloader.download(file_path, file_url)
            downloader.wait_for_all()

        hashf = hashlib.sha1()
        with open(file_path, mode='rb') as f:
            hashf.update(f.read())
        file_sha1 = hashf.hexdigest()

        self.assertEqual(file_sha1_exp, file_sha1)


if __name__ == '__main__':
    unittest.main()
  • Batch file downloading
import unittest
import tempfile
import os
import hashlib

from bdownload import BDownloader


class TestBDownloader(unittest.TestCase):
    def setUp(self):
        self.tmp_dir = tempfile.TemporaryDirectory()

    def tearDown(self):
        self.tmp_dir.cleanup()

    def test_bdownloader_downloads(self):
        files = [
            {
                "file": os.path.join(self.tmp_dir.name, "aria2-x86_64-linux.tar.xz"),
                "url": "https://github.com/Jesseatgao/aria2-patched-static-build/releases/download/1.35.0-win-linux/aria2-x86_64-linux.tar.xz",
                "sha1": "d02dfdab7517e78a257f4403e502f1acc2a795e4"
            },
            {
                "file": os.path.join(self.tmp_dir.name, "mkvtoolnix-x86_64-linux.tar.xz"),
                "url": "https://github.com/Jesseatgao/MKVToolNix-static-builds/releases/download/v47.0.0-mingw-w64-win32v1.0/mkvtoolnix-x86_64-linux.tar.xz",
                "sha1": "19b0c7fc20839693cc0929f092f74820783a9750"
            }
        ]

        file_urls = [(f["file"], f["url"]) for f in files]

        with BDownloader(max_workers=20, progress='mill') as downloader:
            downloader.downloads(file_urls)
            downloader.wait_for_all()

        for f in files:
            hashf = hashlib.sha1()
            with open(f["file"], mode='rb') as fd:
                hashf.update(fd.read())
            file_sha1 = hashf.hexdigest()

            self.assertEqual(f["sha1"], file_sha1)


if __name__ == '__main__':
    unittest.main()

Usage: as a command-line script

Synopsis

bdownload      url | -L URLS [URLS ...]
               [-O OUTPUT | -o OUTPUT [OUTPUT ...]] [-D DIR]
               [-p PROXY] [-n MAX_WORKERS] [-k MIN_SPLIT_SIZE]
               [-s CHUNK_SIZE] [-e COOKIE] [--user-agent USER_AGENT]
               [--referrer REFERRER]
               [--check-certificate {True,true,TRUE,False,false,FALSE}]
               [--ca-certificate CA_CERTIFICATE]
               [--certificate CERTIFICATE] [--private-key PRIVATE_KEY]
               [-P {mill,bar,none}] [--num-pools NUM_POOLS]
               [--pool-size POOL_SIZE] [-l {debug,info,warning,error,critical}]
               [-c | --no-continue]
               [-h]

Description

url

    URL for the file to be downloaded, which can be either a single URL or TAB-separated composite URL pointing to the same file, e.g. "https://www.afilelink.com/afile.tar.gz", "https://chinshou.libccp.mil/luoxuan1981/panjuan-hangyi/tiqianbaozha-key-yasui/qianjunyifa/bengqiyijiao/i-manual/dashboy-basket/zhongzhenkong/xinghuo-xianghui/chunqiao-electronhive-midianfeng/zhenhudan-yasally/afile.tar.gz", and "https://www.afilelink.com/afile.tar.gz\thttps://nianpei.bpfatran.com/afile.tar.gz"

-L URLS [URLS ...], --url URLS [URLS ...]

    URL(s) for the files to be downloaded, each of which might contain TAB-separated URLs pointing to the same file, e.g. -L https://yoursite.net/yourfile.7z, -L "https://yoursite01.net/thefile.7z\thttps://yoursite02.com/thefile.7z", or --url "http://foo.cc/file1.zip" "http://bar.cc/file2.tgz\thttp://bar2.cc/file2.tgz"

-O OUTPUT, --OUTPUT OUTPUT

    a save-as file name (optionally with absolute or relative (to -D DIR) path), e.g. -O afile.tar.gz https://www.afilelink.com/afile.tar.gz

-o OUTPUT [OUTPUT ...], --output OUTPUT [OUTPUT ...]

    one or more file names (optionally prefixed with relative (to -D DIR) or absolute paths), e.g. -o file1.zip ~/file2.tgz, paired with URLs specified by --url or -L

-D DIR, --dir DIR

    directory in which to save the downloaded files [default: directory in which this App is running]

-p PROXY, --proxy PROXY

    proxy either in the form of "http://[user:pass@]host:port" or "socks5://[user:pass@]host:port"

-n MAX_WORKERS, --max-workers MAX_WORKERS

    number of worker threads [default: 20]

-k MIN_SPLIT_SIZE, --min-split-size MIN_SPLIT_SIZE

    file split size in bytes, "1048576, 1024K or 2M" for example [default: 1M]

-s CHUNK_SIZE, --chunk-size CHUNK_SIZE

    every request range size in bytes, "10240, 10K or 1M" for example [default: 100K]

-e COOKIE, --cookie COOKIE

    cookies either in the form of a string (maybe whitespace- and/or semicolon- separated) like "cookie_key=cookie_value cookie_key2=cookie_value2; cookie_key3=cookie_value3", or a file, e.g. named "cookies.txt", in the Netscape cookie file format. NB the option -D DIR does not apply to the cookie file

--user-agent USER_AGENT

    custom user agent

--referrer REFERRER

    HTTP request header "Referer" that applies to all downloads. In particular, use * to tell the downloader to take the request URL as the referrer per download [default: *]

--check-certificate {True,true,TRUE,False,false,FALSE}

    whether to verify the server's TLS certificate or not [default: True]

--ca-certificate CA_CERTIFICATE

    path to the preferred CA bundle file (.pem) or directory with certificates in PEM format of trusted CAs. NB the directory must have been processed using the c_rehash utility from OpenSSL. Also, the cert files in the directory each only contain one CA certificate

--certificate CERTIFICATE

    path to a single file in PEM format containing the client certificate and optionally a chain of additional certificates. If --private-key is not provided, then the file must contain the unencrypted private key as well

--private-key PRIVATE_KEY

    path to a file containing the unencrypted private key to the client certificate

-P {mill,bar,none}, --progress {mill,bar,none}

    progress indicator. To disable this feature, use none. [default: mill]

--num-pools NUM_POOLS

    number of connection pools [default: 20]

--pool-size POOL_SIZE

    max number of connections in the pool [default: 20]

-l {debug,info,warning,error,critical}, --log-level {debug,info,warning,error,critical}

    logger level [default: warning]

-c, --continue

    resume from the partially downloaded files. This is the default behavior

--no-continue

    do not resume from last interruption, i.e. start the download from beginning

-h, --help

    show help message and exit

Examples

bdownload https://www.afilelink.com/afile.tar.gz
bdownload -O /abspath/to/afile.tar.gz https://www.afilelink.com/afile.tar.gz
bdownload -O /abspath/to/a/dir/ https://www.afilelink.com/afile.tar.gz
bdownload -O /abspath/to/afile.tar.gz "https://www.afilelink.com/afile.tar.gz\thttps://nianpei.bpfatran.com/afile.tar.gz"
bdownload -D path/to/working_dir/ -O relpath/to/working_dir/alias_afile.tar.gz https://www.afilelink.com/afile.tar.gz
bdownload -D path/to/working/dir https://www.afilelink.com/afile.tar.gz
bdownload -o /abspath/to/file1.zip ~/file2.tgz -L "http://foo.cc/file1.zip" "http://bar.cc/file2.tgz\thttp://bar2.cc/file2.tgz"
bdownload -D path/to/working/dir -L "http://foo.cc/file1.zip" "http://bar.cc/file2.tgz\thttp://bar2.cc/file2.tgz"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bdownload-0.1.8.tar.gz (49.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page