Easily convert common crawl to image caption set using pyspark

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.6
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

cc2imgcap

Easily convert common crawl to image caption set using pyspark.

Common crawl has 7.5M warc files. They provide links of the web. This simple tool allows you to process one warc in about 40s and get image link along with the alt text.

This makes it possible to do the first step of building a dataset like laion5B in 100k cpu core hours. That's $4k using aws EC2.

Install

pip install cc2imgcap

Python examples

Checkout these examples:

run_on_spark.py it shows how to bring your own spark session

API

This module exposes a single function cc2imgcap which takes the same arguments as the command line tool:

output_path the output path, should probably start with s3://. (required)
wat_index_count the number of wat index files to read, can be None for all. (default 1)
wat_count the number of wat files to read, can be None for all, will randomly subsample if present. (default 100)

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code

python -m pytest -x -s -v tests -k "dummy" to run a specific test

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.6
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

1.3.0

Dec 5, 2022

1.2.0

Dec 1, 2022

1.1.0

Dec 1, 2022

This version

1.0.0

Nov 30, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cc2imgcap-1.0.0.tar.gz (5.2 kB view details)

Uploaded Nov 30, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cc2imgcap-1.0.0-py3-none-any.whl (6.4 kB view details)

Uploaded Nov 30, 2022 Python 3

File details

Details for the file cc2imgcap-1.0.0.tar.gz.

File metadata

Download URL: cc2imgcap-1.0.0.tar.gz
Upload date: Nov 30, 2022
Size: 5.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.14

File hashes

Hashes for cc2imgcap-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`0bc6d95765795f1d83252de79b274398439522925cee3a6b7548eac064e9460f`
MD5	`90338f6c08d2f4ddb8f5c99ededc8559`
BLAKE2b-256	`cb9b60aa815cc681a561055fc6e6aa643f6ccd8deca25aac4879dcea57a2f91a`

See more details on using hashes here.

File details

Details for the file cc2imgcap-1.0.0-py3-none-any.whl.

File metadata

Download URL: cc2imgcap-1.0.0-py3-none-any.whl
Upload date: Nov 30, 2022
Size: 6.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.14

File hashes

Hashes for cc2imgcap-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1a4e3de88d2b6ecc179737b6ce098c9f7f2c77b3851657229face6fd712f8ef8`
MD5	`a8fdd5dbf40ba7c89c00ef3db7b33fb5`
BLAKE2b-256	`d8eca713ccefa756d3845f2f61fd43d467feafa8b0cd1c3644bf45da20a2fffe`

See more details on using hashes here.

cc2imgcap 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cc2imgcap

Install

Python examples

API

For development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes