Clark University, Package for YouTube crawler and cleaning data

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

clarku-youtube-crawler

Clark University YouTube crawler and JSON decoder for YouTube json. Please read documentation in DOCS

Pypi page: "https://pypi.org/project/clarku-youtube-crawler/"

Installing

To install,

pip install clarku-youtube-crawler

The crawler needs multiple other packages to function. If missing requirements (I already include all dependencies so it shouldn't happen), download requirements.txt . Navigate to the folder where it contains requirements.txt and run

pip install -r requirements.txt

Upgrading

To upgrade

pip install clarku-youtube-crawler --upgrade

Go to the project folder, delete config.ini if it is already there.

YouTube API Key

Go to https://cloud.google.com/, click console, and create a project. Under Credentials, copy the API key.
In your project folder, create a "DEVELOPER_KEY.txt" file (must be this file name) and paste your API key.
You can use multiple API keys by putting them on different lines in DEVELOPER_KEY.txt.
The crawler will use up all quotas of one key and try next one, until all quotas are used up.

Example usage

Case 1: crawl videos by keywords,

import clarku_youtube_crawler as cu

# Crawl all JSONs
crawler = cu.RawCrawler()
crawler.build("low visibility")
crawler.crawl("low visibility", start_date=14, start_month=12, start_year=2020, day_count=5)
crawler.crawl("blind", start_date=14, start_month=12, start_year=2020, day_count=5)
crawler.merge_to_workfile()
crawler.crawl_videos_in_list(comment_page_count=1)
crawler.merge_all(save_to='low visibility/all_videos.json')

# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file='low visibility/all_videos.json')

# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build("low visibility")
subtitleCrawler.crawl_csv(
    videos_to_collect="low visibility/videos_to_collect.csv",
    video_id="videoId",
    sub_title_dir="low visibility/subtitles/"
)

Case 2: crawl a videos by a list of ids specified by videoId column in an input CSV

import clarku_youtube_crawler as cu

crawler = cu.RawCrawler()
work_dir = "blind"
crawler.build(work_dir)

# update videos_to_collect.csv to your csv file. Specify the column of video id by video_id
# video ids must be ":" + YouTube video id. E.g., ":wl4m1Rqmq-Y"

crawler.crawl_videos_in_list(video_list_workfile="videos_to_collect.csv",
                             comment_page_count=1,
                             search_key="blind",
                             video_id="videoId"
                             )
crawler.merge_all(save_to='all_raw_data.json')
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file='all_raw_data.json')

# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
    videos_to_collect="videos_to_collect.csv",
    video_id="videoId",
    sub_title_dir=f"YouTube_CSV/subtitles/"
)

Case 3: Search a list of channels by search keys, then crawl all videos belonging to those channels.

import clarku_youtube_crawler as cu

chCrawler = cu.ChannelCrawler()
work_dir = "low visibility"
chCrawler.build(work_dir)
# You can search different channels. All results will be merged
chCrawler.search_channel("low visibility")
chCrawler.search_channel("blind")
chCrawler.merge_to_workfile()
chCrawler.crawl()

# Crawl videos posted by selected channels. channels_to_collect.csv file has which search keys find each channel
crawler = cu.RawCrawler()
crawler.build(work_dir)
crawler.merge_to_workfile(file_dir=work_dir + "/video_search_list/")
crawler.crawl_videos_in_list(comment_page_count=1)
crawler.merge_all()

# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file=work_dir + '/all_videos_visibility.json')

# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
    videos_to_collect=work_dir+"/videos_to_collect.csv",
    video_id="videoId",
    sub_title_dir=work_dir+"/subtitles/"
)

Case 4: You already have a list of channels. You want to crawl all videos of the channels in the list:

import clarku_youtube_crawler as cu

work_dir = 'disability'
chCrawler = cu.ChannelCrawler()
chCrawler.build(work_dir)

chCrawler.crawl(filename='channels_to_collect.csv', channel_header="channelId")

# Crawl videos posted by selected channels (
crawler = cu.RawCrawler()
crawler.build(work_dir)
crawler.merge_to_workfile(file_dir=work_dir + "/video_search_list/")
crawler.crawl_videos_in_list(comment_page_count=10)  # 100 comments per page, 10 page will crawl 1000 comments
crawler.merge_all()
#
# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file=work_dir + '/all_videos.json')

# Crawl subtitles from CSV
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
    videos_to_collect=work_dir + "/videos_to_collect.csv",
    video_id="videoId",
    sub_title_dir=work_dir + "/subtitles/"
)

# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file=work_dir + '/all_videos.json')

# Crawl subtitles from CSV
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
    videos_to_collect=work_dir + "/videos_to_collect.csv",
    video_id="videoId",
    sub_title_dir=work_dir + "/subtitles/"
)

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

2.1.3

Jul 22, 2022

2.1.2

Mar 22, 2022

2.1.1

Mar 22, 2022

2.1.0

Mar 9, 2022

2.0.10

Feb 12, 2022

2.0.9

Feb 12, 2022

2.0.8

Feb 12, 2022

2.0.7

Feb 12, 2022

This version

2.0.6

Jan 25, 2022

2.0.3

Jan 24, 2022

2.0.2

Jan 24, 2022

2.0.1

Jan 24, 2022

1.3.11

Dec 10, 2021

1.3.10

Nov 24, 2021

1.3.9

Nov 24, 2021

1.3.8

Nov 24, 2021

1.3.7

Nov 24, 2021

1.3.6

Nov 24, 2021

1.3.5

Nov 24, 2021

1.3.4

Nov 24, 2021

1.3.3

Nov 24, 2021

1.3.2

Nov 24, 2021

1.3.1

Nov 23, 2021

1.3

Nov 23, 2021

1.2.1

Oct 20, 2021

1.2.0

Oct 20, 2021

1.1.15

Oct 19, 2021

1.1.14

Oct 5, 2021

1.1.13

Jun 8, 2021

1.1.12

Jun 8, 2021

1.1.11

Jun 7, 2021

1.1.10

Jan 29, 2021

1.1.9

Jan 29, 2021

1.1.8

Jan 29, 2021

1.1.7

Jan 28, 2021

1.1.6

Jan 28, 2021

1.1.5

Jan 28, 2021

1.1.4

Jan 28, 2021

1.1.3

Jan 20, 2021

1.1.2

Jan 14, 2021

1.1.2.dev0 pre-release

Jan 14, 2021

1.1.1

Jan 14, 2021

1.1.1.dev0 pre-release

Jan 14, 2021

1.1.0

Jan 12, 2021

1.0.9

Jan 12, 2021

1.0.8

Jan 11, 2021

1.0.7

Jan 8, 2021

1.0.6

Jan 8, 2021

1.0.5

Jan 8, 2021

1.0.4

Jan 8, 2021

1.0.3

Jan 8, 2021

1.0.2

Jan 8, 2021

1.0.2.dev0 pre-release

Jan 8, 2021

1.0.1

Dec 21, 2020

1.0.1.dev0 pre-release

Jan 8, 2021

1.0.0

Dec 21, 2020

0.0.7.dev0 pre-release

Dec 21, 2020

0.0.6

Dec 18, 2020

0.0.6.dev0 pre-release

Dec 19, 2020

0.0.5

Dec 18, 2020

0.0.4

Dec 18, 2020

0.0.3

Dec 18, 2020

0.0.2

Dec 18, 2020

0.0.1

Dec 18, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clarku_youtube_crawler-2.0.6.tar.gz (16.2 kB view details)

Uploaded Jan 25, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clarku_youtube_crawler-2.0.6-py3-none-any.whl (18.4 kB view details)

Uploaded Jan 25, 2022 Python 3

File details

Details for the file clarku_youtube_crawler-2.0.6.tar.gz.

File metadata

Download URL: clarku_youtube_crawler-2.0.6.tar.gz
Upload date: Jan 25, 2022
Size: 16.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/59.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for clarku_youtube_crawler-2.0.6.tar.gz
Algorithm	Hash digest
SHA256	`4732ae97ac9f1ab21bf1fbbd989ffee02ca5379cd5c6f4a267f30d4ae8e050a4`
MD5	`5c025bab73d50a4114dd812b5a755047`
BLAKE2b-256	`55e3c06db238fff1bb9bccb97edcf6d24117e5cd3055ec315c4d2cb99cd94fbf`

See more details on using hashes here.

File details

Details for the file clarku_youtube_crawler-2.0.6-py3-none-any.whl.

File metadata

Download URL: clarku_youtube_crawler-2.0.6-py3-none-any.whl
Upload date: Jan 25, 2022
Size: 18.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/59.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for clarku_youtube_crawler-2.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`faf8f3b53119caa2398d611d7d89635cb9371e91b035b60b28669f2c859400fe`
MD5	`7678a2f196185f5ba02364bda1e739d9`
BLAKE2b-256	`bbae8e91e967f3f9bcf7c7d3a8d9f662cf659dd27bb012d062aaf9011f009c4c`

See more details on using hashes here.

clarku-youtube-crawler 2.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

clarku-youtube-crawler

Installing

Upgrading

YouTube API Key

Example usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes