Skip to main content

Doing some data things in a memory efficient manner

Project description

How To Data
======================

1.) Split data.
2.) Create a generator that will take the data as an iterator, yielding key,value pairs.
3.) Sort each list of key/value pairs by the key.
4.) Use heap to merge lists of key/value pairs by the key.
5.) Group key/value pairs by the key.
6.) Reduce each key grouped values to one value yielding a single key/value pair.

In lieu of a key, you may use a key function as long as it produces the
same key throughout the map-sort-merge-group phases.

Split data
----------------------

Use split_file to split up your data files.

import os
from karld.loadump import split_file

big_file_names = [
"bigfile1.csv",
"bigfile2.csv",
"bigfile3.csv"
]

data_path = os.path.join('path','to','data', 'root')


def main():
for filename in big_file_names:
# Name the directory to write the split files into.
# I'll make it after the name of the file, removing the extension.
out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', ''))

# Split the file, with a default max_lines=200000 per shard of the file.
split_file(os.path.join(data_path, filename), out_dir)


if __name__ == "__main__":
main()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

karld-0.0.7.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

karld-0.0.7.macosx-10.9-intel.exe (73.7 kB view details)

Uploaded Source

File details

Details for the file karld-0.0.7.tar.gz.

File metadata

  • Download URL: karld-0.0.7.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for karld-0.0.7.tar.gz
Algorithm Hash digest
SHA256 26bae19058fe029403a66e7d58f575f85d89f6948ade7c1bc8f9463c1236e47d
MD5 ed7b345a18aeedad23948610e09bfe38
BLAKE2b-256 5c8b2a1188102c5544f995c3f8dab473da75513015368c772bb9a5c5b4346ae4

See more details on using hashes here.

File details

Details for the file karld-0.0.7.macosx-10.9-intel.exe.

File metadata

File hashes

Hashes for karld-0.0.7.macosx-10.9-intel.exe
Algorithm Hash digest
SHA256 c587985fa3646859352e70609b1d1b398795f9a62d32a8e9a5202df13ddc16f2
MD5 5ba173e675a655a7e0ea5f10b3661ea2
BLAKE2b-256 660aab9c29e4de304a3ba2a2c5cc9da39307347d4f46820a7f8a08c2eeb71dcd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page