Skip to main content

Doing some data things in a memory efficient manner

Project description

How To Data
======================

1.) Split data.
2.) Create a generator that will take the data as an iterator, yielding key,value pairs.
3.) Sort each list of key/value pairs by the key.
4.) Use heap to merge lists of key/value pairs by the key.
5.) Group key/value pairs by the key.
6.) Reduce each key grouped values to one value yielding a single key/value pair.

In lieu of a key, you may use a key function as long as it produces the
same key throughout the map-sort-merge-group phases.

Split data
----------------------

Use split_file to split up your data files.

import os
from karld.loadump import split_file

big_file_names = [
"bigfile1.csv",
"bigfile2.csv",
"bigfile3.csv"
]

data_path = os.path.join('path','to','data', 'root')


def main():
for filename in big_file_names:
# Name the directory to write the split files into.
# I'll make it after the name of the file, removing the extension.
out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', ''))

# Split the file, with a default max_lines=200000 per shard of the file.
split_file(os.path.join(data_path, filename), out_dir)


if __name__ == "__main__":
main()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

karld-0.0.8.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

karld-0.0.8.macosx-10.9-intel.exe (75.0 kB view details)

Uploaded Source

File details

Details for the file karld-0.0.8.tar.gz.

File metadata

  • Download URL: karld-0.0.8.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for karld-0.0.8.tar.gz
Algorithm Hash digest
SHA256 639719aba92b12cbaaf5006d35ee8f7abde1dea6bde897d07ab9f124d754a5b5
MD5 e85583b3725ef8a308795ea9cdb1679b
BLAKE2b-256 7fc171abc5c6db86d3de99b8627ebc6167643a94e8a9a3428f733c0bb6740fca

See more details on using hashes here.

File details

Details for the file karld-0.0.8.macosx-10.9-intel.exe.

File metadata

File hashes

Hashes for karld-0.0.8.macosx-10.9-intel.exe
Algorithm Hash digest
SHA256 6f0c675a65e802c4e818453026d7c0984f93617632e98b0a2454e5940f2eb345
MD5 881f53aa2756da63fe8958fa60a92aca
BLAKE2b-256 9158c8c9ab371378895d2527243728c97bd8d16e7c7315c0c5c495c7e394f2c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page