csv-dataset helps to read csv files and create descriptive and efficient input pipelines for deep learning in a streaming fashion
Project description
csv-dataset
CsvDataset helps to read a csv file and create descriptive and efficient input pipelines for deep learning.
CsvDataset iterates the records of the csv file in a streaming fashion, so the full dataset does not need to fit into memory.
Install
$ pip install csv-dataset
Usage
Suppose we have a csv file whose absolute path is filepath:
open_time,open,high,low,close,volume
1576771200000,7145.99,7150.0,7141.01,7142.33,21.094283
1576771260000,7142.89,7142.99,7120.7,7125.73,118.279931
1576771320000,7125.76,7134.46,7123.12,7123.12,41.03628
1576771380000,7123.74,7128.06,7117.12,7126.57,39.885367
1576771440000,7127.34,7137.84,7126.71,7134.99,25.138154
1576771500000,7134.99,7144.13,7132.84,7141.64,26.467308
...
from csv_dataset import (
Dataset,
CsvReader
)
dataset = CsvDataset(
CsvReader(
filepath,
float,
# Abandon the first column and only pick the following
indexes=[1, 2, 3, 4, 5],
header=True
)
).window(3, 1).batch(2)
for element in dataset:
print(element)
The following output shows one print.
[[[7145.99, 7150.0, 7141.01, 7142.33, 21.094283]
[7142.89, 7142.99, 7120.7, 7125.73, 118.279931]
[7125.76, 7134.46, 7123.12, 7123.12, 41.03628 ]]
[[7142.89, 7142.99, 7120.7, 7125.73, 118.279931]
[7125.76, 7134.46, 7123.12, 7123.12, 41.03628 ]
[7123.74, 7128.06, 7117.12, 7126.57, 39.885367]]]
...
Dataset(reader: AbstractReader)
dataset.window(size: int, shift: int = None, stride: int = 1) -> self
Defines the window size, shift and stride.
The default window size is 1 which means the dataset has no window.
Parameter explanation
Suppose we have a raw data set
[ 1 2 3 4 5 6 7 8 9 ... ]
And the following is a window of (size=4, shift=3, stride=2)
|-------------- size:4 --------------|
|- stride:2 -| |
| | |
win 0: [ 1 3 5 7 ] --------|-----
shift:3
win 1: [ 4 6 8 10 ] --------|-----
win 2: [ 7 9 11 13 ]
...
dataset.batch(batch: int) -> self
Defines batch size.
The default batch size of the dataset is 1 which means it is single-batch
If batch is 2
batch 0: [[ 1 3 5 7 ]
[ 4 6 8 10 ]]
batch 1: [[ 7 9 11 13 ]
[ 10 12 14 16 ]]
...
dataset.get() -> Optional[np.ndarray]
Gets the data of the next batch
dataset.reset() -> self
Resets dataset
dataset.read(amount: int, reset_buffer: bool = False)
- amount the maximum length of data the dataset will read
- reset_buffer if
True, the dataset will reset the data of the previous window in the buffer
Reads multiple batches at a time
If we reset_buffer, then the next read will not use existing data in the buffer, and the result will have no overlap with the last read.
dataset.reset_buffer() -> None
Reset buffer, so that the next read will have no overlap with the last one
dataset.lines_need(reads: int) -> int
Calculates and returns how many lines of the underlying datum are needed for reading reads times
dataset.max_reads(max_lines: int) -> int | None
Calculates max_lines lines could afford how many reads
dataset.max_reads() -> int | None
Calculates the current reader could afford how many reads.
If max_lines of current reader is unset, then it returns None
CsvReader(filepath, dtype, indexes, **kwargs)
- filepath
strabsolute path of the csv file - dtype
Callabledata type. We should only usefloatorintfor this argument. - indexes
List[int]column indexes to pick from the lines of the csv file - kwargs
- header
bool = Falsewhether we should skip reading the header line. - splitter
str = ','the column splitter of the csv file - normalizer
List[NormalizerProtocol]list of normalizer to normalize each column of data. ANormalizerProtocolshould contains two methods,normalize(float) -> floatto normalize the given datum andrestore(float) -> floatto restore the normalized datum. - max_lines
int = -1max lines of the csv file to be read. Defaults to-1which means no limit.
- header
reader.reset()
Resets reader pos
property reader.max_lines
Gets max_lines
setter reader.max_lines = lines
Changes max_lines
reader.readline() -> list
Returns the converted value of the next line
reader csvReader.lines
Returns number of lines has been read
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file csv-dataset-3.5.0.tar.gz.
File metadata
- Download URL: csv-dataset-3.5.0.tar.gz
- Upload date:
- Size: 9.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d115861019f7d1b1bedc802cfda077907e42f2ffaf9109e8d57f0cbd467a3ef
|
|
| MD5 |
1a491fcbb5bc59bd8848f805d03a0ff9
|
|
| BLAKE2b-256 |
e914d504b2a84cb0ebcec98b5fdbefa59dfcd2e4cd5526f1bb81ee53fecb294e
|
File details
Details for the file csv_dataset-3.5.0-py3-none-any.whl.
File metadata
- Download URL: csv_dataset-3.5.0-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
769acab9ff8ef625c9b8aa8cc3b031bce63ce5920a18c1751db4468cb2bd5bb9
|
|
| MD5 |
7ba61e4b9345080c14dcb9b794289774
|
|
| BLAKE2b-256 |
3ec097659efab3dbb451e7c16003fa9c72fed5591207a158c68c7522da3980df
|