Doing some data things in a memory efficient manner
Project description
How To Data
======================
1.) Split data.
2.) Create a generator that will take the data as an iterator, yielding key,value pairs.
3.) Sort each list of key/value pairs by the key.
4.) Use heap to merge lists of key/value pairs by the key.
5.) Group key/value pairs by the key.
6.) Reduce each key grouped values to one value yielding a single key/value pair.
In lieu of a key, you may use a key function as long as it produces the
same key throughout the map-sort-merge-group phases.
Split data
----------------------
Use split_file to split up your data files.
import os
from karld.loadump import split_file
big_file_names = [
"bigfile1.csv",
"bigfile2.csv",
"bigfile3.csv"
]
data_path = os.path.join('path','to','data', 'root')
def main():
for filename in big_file_names:
# Name the directory to write the split files into.
# I'll make it after the name of the file, removing the extension.
out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', ''))
# Split the file, with a default max_lines=200000 per shard of the file.
split_file(os.path.join(data_path, filename), out_dir)
if __name__ == "__main__":
main()
======================
1.) Split data.
2.) Create a generator that will take the data as an iterator, yielding key,value pairs.
3.) Sort each list of key/value pairs by the key.
4.) Use heap to merge lists of key/value pairs by the key.
5.) Group key/value pairs by the key.
6.) Reduce each key grouped values to one value yielding a single key/value pair.
In lieu of a key, you may use a key function as long as it produces the
same key throughout the map-sort-merge-group phases.
Split data
----------------------
Use split_file to split up your data files.
import os
from karld.loadump import split_file
big_file_names = [
"bigfile1.csv",
"bigfile2.csv",
"bigfile3.csv"
]
data_path = os.path.join('path','to','data', 'root')
def main():
for filename in big_file_names:
# Name the directory to write the split files into.
# I'll make it after the name of the file, removing the extension.
out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', ''))
# Split the file, with a default max_lines=200000 per shard of the file.
split_file(os.path.join(data_path, filename), out_dir)
if __name__ == "__main__":
main()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
karld-0.0.8.tar.gz
(10.6 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file karld-0.0.8.tar.gz.
File metadata
- Download URL: karld-0.0.8.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
639719aba92b12cbaaf5006d35ee8f7abde1dea6bde897d07ab9f124d754a5b5
|
|
| MD5 |
e85583b3725ef8a308795ea9cdb1679b
|
|
| BLAKE2b-256 |
7fc171abc5c6db86d3de99b8627ebc6167643a94e8a9a3428f733c0bb6740fca
|
File details
Details for the file karld-0.0.8.macosx-10.9-intel.exe.
File metadata
- Download URL: karld-0.0.8.macosx-10.9-intel.exe
- Upload date:
- Size: 75.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f0c675a65e802c4e818453026d7c0984f93617632e98b0a2454e5940f2eb345
|
|
| MD5 |
881f53aa2756da63fe8958fa60a92aca
|
|
| BLAKE2b-256 |
9158c8c9ab371378895d2527243728c97bd8d16e7c7315c0c5c495c7e394f2c3
|