Easily convert common crawl to image caption set using pyspark

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.6
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

cc2imgcap

Easily convert common crawl to image caption set using pyspark.

Common crawl has 5M wat files. They provide links of the web. This simple tool allows you to process one warc in about 50s and get image link along with the alt text.

It also runs deduplication against url+text in order to save on output space and speed up the process.

This makes it possible to do the first step of building a dataset like laion5B in 70k cpu core hours. (5*10^6*50/(3600)) That's $2.8k using aws EC2 (0.04$/core hour)

What hardware to pick ?

cpu128-dy-c6i-32xlarge instances are advised. Spark stores the non duplicated first stage in local disk. They should be nvme drive for speed during deduplication. At this first stage, one wat takes about 20MB, so the total (over all workers) space must be more than 20MB times wat count. So for example for the whole CC, that means 100TB. So for example that can fit in 150 instances with 1TB nvme drive each. 150 instances of 128 cores is 19200 cores so the whole processing takes 2h. Less instances with bigger hard drives can work too. It's also a possibility to do the processing in multiple pieces if temporary disk space is an issue by specifying --multipart.

Document type

This tool support extracting several documents from CC:

image/text: about 300B after dedup
audio/text: about 3B after dedup

They can be selected with eg --document_type audio. You may experiment with more document kinds by running python example single_warc_example.py and exploring the resulting output.parquet.

Install

pip install cc2imgcap

Python examples

Checkout these examples:

run_on_spark.py it shows how to bring your own spark session

If you have a slurm cluster, refer to https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83 to start a spark cluster there.

API

This module exposes a single function cc2imgcap which takes the same arguments as the command line tool:

output_path the output path, should probably start with s3://. The output will be written to this path sufixed by the date (required)
wat_index_count the number of wat index files to read, can be None for all. (default 1)
wat_count the number of wat files to read, can be None for all, will randomly subsample if present. (default 100)
master the spark master url. (default local)
num_cores the number of cores of each spark executor. (default 128)
mem_gb the memory of each spark executor. (default 256)
multipart runs the processing of the specified number of parts, merge at the end (default None)
shuffle randomly shuffle the output right before saving (default True)
resume the specific path of the output to resume (default None)
spark_builder a function that create a spark session, None will default to the built-in methods (default None)
document_type the kind of document to extract (default image)
source_cc_protocol get common crawl from http or s3 (default s3)

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code

python -m pytest -x -s -v tests -k "dummy" to run a specific test

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.6
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

1.3.0

Dec 5, 2022

1.2.0

Dec 1, 2022

1.1.0

Dec 1, 2022

1.0.0

Nov 30, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cc2imgcap-1.3.0.tar.gz (8.8 kB view hashes)

Uploaded Dec 5, 2022 Source

Built Distribution

cc2imgcap-1.3.0-py3-none-any.whl (10.7 kB view hashes)

Uploaded Dec 5, 2022 Python 3

Hashes for cc2imgcap-1.3.0.tar.gz

Hashes for cc2imgcap-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`4fb9ca2904f59eca3afe60800661b00b1682d926bd7364063efd32a5bcd4ccca`
MD5	`02b341f66eb624b76db7b37473ac5296`
BLAKE2b-256	`74b45a93c64d027823e3be9a4ca733839e6abea6c4147851dfc81695fd498199`

Hashes for cc2imgcap-1.3.0-py3-none-any.whl

Hashes for cc2imgcap-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8120ccf2d2682aef385f73eade22c36404598e77085ba7f12774bfc879a55dda`
MD5	`1f2551389c4cabf71a10efefa5af4197`
BLAKE2b-256	`4b91b4d25f0d6cd91c86608c8fdb14869c16fd020c185fedb8a1dd1abe5235d1`