Skip to main content

Easily convert common crawl to image caption set using pyspark

Project description

cc2imgcap

pypi Open In Colab Try it on gitpod

Easily convert common crawl to image caption set using pyspark.

Common crawl has 7.5M warc files. They provide links of the web. This simple tool allows you to process one warc in about 40s and get image link along with the alt text.

This makes it possible to do the first step of building a dataset like laion5B in 100k cpu core hours. That's $4k using aws EC2.

Install

pip install cc2imgcap

Python examples

Checkout these examples:

API

This module exposes a single function cc2imgcap which takes the same arguments as the command line tool:

  • output_path the output path, should probably start with s3://. (required)
  • wat_index_count the number of wat index files to read, can be None for all. (default 1)
  • wat_count the number of wat files to read, can be None for all, will randomly subsample if present. (default 100)

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code

python -m pytest -x -s -v tests -k "dummy" to run a specific test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cc2imgcap-1.0.0.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cc2imgcap-1.0.0-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file cc2imgcap-1.0.0.tar.gz.

File metadata

  • Download URL: cc2imgcap-1.0.0.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.14

File hashes

Hashes for cc2imgcap-1.0.0.tar.gz
Algorithm Hash digest
SHA256 0bc6d95765795f1d83252de79b274398439522925cee3a6b7548eac064e9460f
MD5 90338f6c08d2f4ddb8f5c99ededc8559
BLAKE2b-256 cb9b60aa815cc681a561055fc6e6aa643f6ccd8deca25aac4879dcea57a2f91a

See more details on using hashes here.

File details

Details for the file cc2imgcap-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: cc2imgcap-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 6.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.14

File hashes

Hashes for cc2imgcap-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1a4e3de88d2b6ecc179737b6ce098c9f7f2c77b3851657229face6fd712f8ef8
MD5 a8fdd5dbf40ba7c89c00ef3db7b33fb5
BLAKE2b-256 d8eca713ccefa756d3845f2f61fd43d467feafa8b0cd1c3644bf45da20a2fffe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page