Package and CLI for downloading media from a webpage.
Project description
pixelripper
Package and CLI for downloading media from a webpage.
Install with:
pip install pixelripper
pixelripper contains a class called PixelRipper and a subclass called PixelRipperSelenium.
PixelRipper uses the requests library to fetch webpages and PixelRipperSelenium uses a selenium based engine to do the same.
The selenium engine is slower and requires more resources, but is useful for webpages
that don't render their media content without a JavaScript engine.
It can use either Firefox or Chrome browsers.
Note: You must have the appropriate webdriver for your machine and browser
version installed in order to use PixelRipperSelenium.
pixelripper can be used programmatically or from the command line.
Programmatic usage:
from pixelripper import PixelRipper from pathlib import Path ripper = PixelRipper() # Scrape the page for image, video, and audio urls. ripper.rip("https://somewebsite.com") # Any content urls found will now be accessible as members of ripper. print(ripper.image_urls) print(ripper.video_urls) print(ripper.audio_urls) # All the urls found on a page can be accessed through the ripper.scraper member. all_urls = ripper.scraper.get_links("all") # The urls can also be filtered according to a list of extensions # with the filter_by_extensions function. # The following will return only .jpg and .mp3 file urls. urls = ripper.filter_by_extensions([".jpg", ".mp3"]) # The content can then be downloaded. ripper.download_files(urls, Path.cwd()/"somewebsite") # Alternatively, everything in ripper.image_urls, ripper.video_urls, and ripper.audio_urls # can be downloaded with just a call to ripper.download_all() ripper.download_all(Path.cwd()/"somewebsite") # Separate subfolders named "images", "videos", and "audio" # will be created inside the "somewebsite" folder when using this function.
Command line usage:
>pixelripper -h usage: pixelripper [-h] [-s] [-nh] [-b BROWSER] [-o OUTPUT_PATH] [-eh [EXTRA_HEADERS ...]] url positional arguments: url The url to scrape for media. options: -h, --help show this help message and exit -s, --selenium Use selenium to get page content instead of requests. -nh, --no_headless Don't use headless mode when using -s/--selenium. -b BROWSER, --browser BROWSER The browser to use when using -s/--selenium. Can be 'firefox' or 'chrome'. You must have the appropriate webdriver installed for your machine and browser version in order to use the selenium engine. -o OUTPUT_PATH, --output_path OUTPUT_PATH Output directory to save results to. If not specified, a folder with the name of the webpage will be created in the current working directory. -eh [EXTRA_HEADERS ...], --extra_headers [EXTRA_HEADERS ...] Extra headers to use when requesting files as key, value pairs. Keys and values whould be colon separated and pairs should be space separated. e.g. -eh Referer:website.com/page Host:website.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pixelripper-0.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab2050a0e300b86584268257d0ec7e8638740c90262b5ef8932d576c2fef26c9 |
|
MD5 | 4f475006b7353d67f3dc18f2475c0a0f |
|
BLAKE2b-256 | 26321f73925401b4791c1ed1cc6cbacd516d68124b3398bddf0a372454485ff0 |