Data collection manager
Project description
aswan
collect and organize data into a T1 data lake and T2 tables. named after the Aswan Dam
Quickstart
import aswan
config = aswan.AswanConfig.default_from_dir("imdb-env")
celeb_table = config.get_prod_table("person")
movie_table = config.get_prod_table("movie")
project = aswan.Project(config) # this creates the env directories by default
@project.register_handler
class CelebHandler(aswan.UrlHandler):
url_root = "https://www.imdb.com"
def parse_soup(self, soup):
return {
"name": soup.find("h1").find("span").text.strip(),
"dob": soup.find("div", id="name-born-info").find("time")["datetime"],
}
@project.register_handler
class MovieHandler(aswan.UrlHandler):
url_root = "https://www.imdb.com"
def parse_soup(self, soup):
for cast in soup.find("table", class_="cast_list").find_all("td", class_="primary_photo")[:3]:
link = cast.find("a")["href"]
self.register_link_to_handler(link, CelebHandler)
return {
"title": soup.find("title").text.replace(" - IMDb", "").strip(),
"summary": soup.find("div", class_="summary_text").text.strip(),
"year": int(soup.find("span", id="titleYear").find("a").text),
}
# all this registering can be done simpler :)
project.register_t2_table(celeb_table)
project.register_t2_table(movie_table)
@project.register_t2_integrator
class MovieIntegrator(aswan.FlexibleDfParser):
handlers = [MovieHandler]
def url_parser(self, url):
return {"id": url.split("/")[-1]}
def get_t2_table(self):
return movie_table
@project.register_t2_integrator
class CelebIntegrator(aswan.FlexibleDfParser):
handlers = [CelebHandler]
def get_t2_table(self):
return celeb_table
def add_init_urls():
movie_urls = [
"https://www.imdb.com/title/tt1045772",
"https://www.imdb.com/title/tt2543164",
]
person_urls = ["https://www.imdb.com/name/nm0000190"]
project.add_urls_to_handler(MovieHandler, movie_urls)
project.add_urls_to_handler(CelebHandler, person_urls)
add_init_urls()
project.run(with_monitor_process=True)
2021-05-09 22:13.42 [info ] running function reset_surls env=prod function_batch=run_prep
...
2021-05-09 22:13.45 [info ] ray dashboard: http://127.0.0.1:8266
...
2021-05-09 22:13.45 [info ] monitor app at: http://localhost:6969
...
movie_table.get_full_df()
| title | summary | year | id | |
|---|---|---|---|---|
| 0 | Arrival (2016) | A linguist works with the military to communicate with alien lifeforms after twelve mysterious spacecraft appear around the world. | 2016 | tt2543164 |
| 0 | I Love You Phillip Morris (2009) | A cop turns con man once he comes out of the closet. Once imprisoned, he meets the second love of his life, whom he'll stop at nothing to be with. | 2009 | tt1045772 |
celeb_table.get_full_df()
| name | dob | |
|---|---|---|
| 0 | Matthew McConaughey | 1969-11-4 |
| 0 | Leslie Mann | 1972-3-26 |
| 0 | Jeremy Renner | 1971-1-7 |
| 0 | Forest Whitaker | 1961-7-15 |
| 0 | Jim Carrey | 1962-1-17 |
| 0 | Amy Adams | 1974-8-20 |
| 0 | Ewan McGregor | 1971-3-31 |
Pre v0.0.0 laundry list
will probably need to separate a few things from it:
- t2extractor
- unstructured json to tabular data automatically
- aswan.t2.extractor
- scheduler
TODO
- dvc integration
- export to dataset template
- maybe part of the dataset
- cleanup requirements
- s3, scp for push/pull
- add verified invalid output that is not parsing error
- selective push / pull
- with possible nuking of remote archive
- cleaning local obj store (when envs blow up, ide dies)
- parsing/connection error confusion
- also broken session thing
- conn session cpu requirement
- resource limits
- transfering / ignoring cookies
- lots of things with extractors
- template projects
- oddsportal
- updating thingy, based on latest match in season
- footy
- rotten
- boxoffice
- oddsportal
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
aswan-0.1.1.tar.gz
(41.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
aswan-0.1.1-py3-none-any.whl
(47.8 kB
view details)
File details
Details for the file aswan-0.1.1.tar.gz.
File metadata
- Download URL: aswan-0.1.1.tar.gz
- Upload date:
- Size: 41.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.27.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a72226406d5c60acbf88f617006ad30469374677ce9517a76e5078f60f5dc02f
|
|
| MD5 |
46560798a1e1794e2b8a3ce30831a2e6
|
|
| BLAKE2b-256 |
4a0efa06f2fe3086654033a0f2bc2ebac027302dc382f5415347fba8538e5613
|
File details
Details for the file aswan-0.1.1-py3-none-any.whl.
File metadata
- Download URL: aswan-0.1.1-py3-none-any.whl
- Upload date:
- Size: 47.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.27.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d164bea3d14d2a3cf3a9f84e412bc03d6bf6e61b908548b1e5bd41870be4f564
|
|
| MD5 |
844aa04779e7457812da2f7cbf317e93
|
|
| BLAKE2b-256 |
b32fa7432b1d31f15c2d4b8b47cd6ce71014e24e54d3446e195cf036e0d22d88
|