Doing some data things in a memory efficient manner
Project description
How To Data
======================
1.) Split data.
2.) Create a generator that will take the data as an iterator, yielding key,value pairs.
3.) Sort each list of key/value pairs by the key.
4.) Use heap to merge lists of key/value pairs by the key.
5.) Group key/value pairs by the key.
6.) Reduce each key grouped values to one value yielding a single key/value pair.
In lieu of a key, you may use a key function as long as it produces the
same key throughout the map-sort-merge-group phases.
Split data
----------------------
Use split_file to split up your data files.
import os
from karld.loadump import split_file
big_file_names = [
"bigfile1.csv",
"bigfile2.csv",
"bigfile3.csv"
]
data_path = os.path.join('path','to','data', 'root')
def main():
for filename in big_file_names:
# Name the directory to write the split files into.
# I'll make it after the name of the file, removing the extension.
out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', ''))
# Split the file, with a default max_lines=200000 per shard of the file.
split_file(os.path.join(data_path, filename), out_dir)
if __name__ == "__main__":
main()
======================
1.) Split data.
2.) Create a generator that will take the data as an iterator, yielding key,value pairs.
3.) Sort each list of key/value pairs by the key.
4.) Use heap to merge lists of key/value pairs by the key.
5.) Group key/value pairs by the key.
6.) Reduce each key grouped values to one value yielding a single key/value pair.
In lieu of a key, you may use a key function as long as it produces the
same key throughout the map-sort-merge-group phases.
Split data
----------------------
Use split_file to split up your data files.
import os
from karld.loadump import split_file
big_file_names = [
"bigfile1.csv",
"bigfile2.csv",
"bigfile3.csv"
]
data_path = os.path.join('path','to','data', 'root')
def main():
for filename in big_file_names:
# Name the directory to write the split files into.
# I'll make it after the name of the file, removing the extension.
out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', ''))
# Split the file, with a default max_lines=200000 per shard of the file.
split_file(os.path.join(data_path, filename), out_dir)
if __name__ == "__main__":
main()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
karld-0.0.7.tar.gz
(9.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file karld-0.0.7.tar.gz.
File metadata
- Download URL: karld-0.0.7.tar.gz
- Upload date:
- Size: 9.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26bae19058fe029403a66e7d58f575f85d89f6948ade7c1bc8f9463c1236e47d
|
|
| MD5 |
ed7b345a18aeedad23948610e09bfe38
|
|
| BLAKE2b-256 |
5c8b2a1188102c5544f995c3f8dab473da75513015368c772bb9a5c5b4346ae4
|
File details
Details for the file karld-0.0.7.macosx-10.9-intel.exe.
File metadata
- Download URL: karld-0.0.7.macosx-10.9-intel.exe
- Upload date:
- Size: 73.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c587985fa3646859352e70609b1d1b398795f9a62d32a8e9a5202df13ddc16f2
|
|
| MD5 |
5ba173e675a655a7e0ea5f10b3661ea2
|
|
| BLAKE2b-256 |
660aab9c29e4de304a3ba2a2c5cc9da39307347d4f46820a7f8a08c2eeb71dcd
|