Easily convert common crawl to image caption set using pyspark
Project description
cc2imgcap
Easily convert common crawl to image caption set using pyspark.
Common crawl has 7.5M warc files. They provide links of the web. This simple tool allows you to process one warc in about 40s and get image link along with the alt text.
This makes it possible to do the first step of building a dataset like laion5B in 100k cpu core hours. That's $4k using aws EC2.
Install
pip install cc2imgcap
Python examples
Checkout these examples:
- run_on_spark.py it shows how to bring your own spark session
API
This module exposes a single function cc2imgcap which takes the same arguments as the command line tool:
- output_path the output path, should probably start with s3://. (required)
- wat_index_count the number of wat index files to read, can be None for all. (default 1)
- wat_count the number of wat files to read, can be None for all, will randomly subsample if present. (default 100)
For development
Either locally, or in gitpod (do export PIP_USER=false there)
Setup a virtualenv:
python3 -m venv .env
source .env/bin/activate
pip install -e .
to run tests:
pip install -r requirements-test.txt
then
make lint
make test
You can use make black to reformat the code
python -m pytest -x -s -v tests -k "dummy" to run a specific test
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cc2imgcap-1.0.0.tar.gz.
File metadata
- Download URL: cc2imgcap-1.0.0.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0bc6d95765795f1d83252de79b274398439522925cee3a6b7548eac064e9460f
|
|
| MD5 |
90338f6c08d2f4ddb8f5c99ededc8559
|
|
| BLAKE2b-256 |
cb9b60aa815cc681a561055fc6e6aa643f6ccd8deca25aac4879dcea57a2f91a
|
File details
Details for the file cc2imgcap-1.0.0-py3-none-any.whl.
File metadata
- Download URL: cc2imgcap-1.0.0-py3-none-any.whl
- Upload date:
- Size: 6.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a4e3de88d2b6ecc179737b6ce098c9f7f2c77b3851657229face6fd712f8ef8
|
|
| MD5 |
a8fdd5dbf40ba7c89c00ef3db7b33fb5
|
|
| BLAKE2b-256 |
d8eca713ccefa756d3845f2f61fd43d467feafa8b0cd1c3644bf45da20a2fffe
|