Skip to main content

The GreenKey ASRToolkit provides tools for automatic speech recognition (ASR) file conversion and corpora organization.

Project description

GreenKey Automatic Speech Recognition (ASR) Toolkit Build Status


The GreenKey ASRToolkit provides tools for file conversion and ASR corpora organization. These are intended to simplify the workflow for building, customizing, and analyzing ASR models, useful for scientists, engineers, and other technologists in speech recognition.

File formats supported

File formats have format-specific handlers in asrtoolkit/data_handlers. The scripts convert_transcript and wer support stm, srt, vtt, txt, and GreenKey json formatted transcripts. A custom html format is also available, though this should not be considered a stable format for long term storage as it is subject to change without notice.

convert_transcript

usage: convert_transcript [-h] input_file output_file

convert a single transcript from one text file format to another

positional arguments:
  input_file   input file
  output_file  output file

optional arguments:
  -h, --help   show this help message and exit

This tool allows for easy conversion among file formats listed above.

Note: Attributes of a segment object not present in a parsed file retain their default values

  • For example, a segment object is created for each line of an STM line
  • each is initialized with the following default values which are not encoded in STM files: formatted_text=''; confidence=1.0

wer

usage: wer [-h] [--char-level] [--ignore-nsns]
           reference_file transcript_file

Compares a reference and transcript file and calculates word error rate (WER)
between these two files

positional arguments:
  reference_file   reference "truth" file
  transcript_file  transcript possibly containing errors

optional arguments:
  -h, --help       show this help message and exit
  --char-level     calculate character error rate instead of word error rate
  --ignore-nsns    ignore non silence noises like um, uh, etc.

This tool allows for easy comparison of reference and hypothesis transcripts in any format listed above.

clean_formatting

usage: clean_formatting.py [-h] files [files ...]

cleans input *.txt files and outputs *_cleaned.txt

positional arguments:
  files       list of input files

optional arguments:
  -h, --help  show this help message and exit

This script standardizes how abbreviations, numbers, and other formatted text is expressed so that ASR engines can easily use these files as training or testing data. Standardizing the formatting of output is essential for reproducible measurements of ASR accuracy.

split_audio_file

usage: split_audio_file [-h] [--target-dir TARGET_DIR] audio_file transcript

Split an audio file using valid segments from a transcript file. For this
utility, transcript files must contain start/stop times.

positional arguments:
  audio_file            input audio file
  transcript            transcript

optional arguments:
  -h, --help            show this help message and exit
  --target-dir TARGET_DIR
                        Path to target directory

prepare_audio_corpora

usage: prepare_audio_corpora [-h] [--target-dir TARGET_DIR]
                             corpora [corpora ...]

Copy and organize specified corpora into a target directory. Training,
testing, and development sets will be created automatically if not already
defined.

positional arguments:
  corpora               Name of one or more directories in directory this
                        script is run

optional arguments:
  -h, --help            show this help message and exit
  --target-dir TARGET_DIR
                        Path to target directory

This script scrapes a list of directories for paired STM and SPH files. If train, test, and dev folders are present, these labels are used for the output folder. By default, a target directory of 'input-data' will be created. Note that filenames with hyphens will be sanitized to underscores and that audio files will be forced to single channel, 16 kHz, signed PCM format. If two channels are present, only the first will be used.

degrade_audio_file

usage: degrade_audio_file input_file1.wav input_file2.wav

Degrade audio files to 8 kHz format similar to G711 codec

This script reduces audio quality of input audio files so that acoustic models can learn features from telephony with the G711 codec.

extract_excel_spreadsheets

Note that the use of this function requires the separate installation of pandas. This can be done via pip install pandas.

usage: extract_excel_spreadsheets.py [-h] [--input-folder INPUT_FOLDER]
                                     [--output-corpus OUTPUT_CORPUS]

convert a folder of excel spreadsheets to a corpus of text files

optional arguments:
  -h, --help            show this help message and exit
  --input-folder INPUT_FOLDER
                        input folder of excel spreadsheets ending in .xls or
                        .xlsx
  --output-corpus OUTPUT_CORPUS
                        output folder for storing text corpus

Requirements

  • Python >= 3.5 with pip

Contributing

Code of Conduct

Please make sure you read and observe our Code of Conduct.

Pull Request process

  1. Fork it
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request

Authors

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asrtoolkit-0.1.15.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asrtoolkit-0.1.15-py3-none-any.whl (30.4 kB view details)

Uploaded Python 3

File details

Details for the file asrtoolkit-0.1.15.tar.gz.

File metadata

  • Download URL: asrtoolkit-0.1.15.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.18.4 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.27.0 CPython/3.5.2

File hashes

Hashes for asrtoolkit-0.1.15.tar.gz
Algorithm Hash digest
SHA256 c885c831798f4782e48072f33abf8e72ed159ec98be3d41e5a9b37f565bc3848
MD5 c8b403a9fbce637004f85724d094e016
BLAKE2b-256 fb61c8aadb1ad938788c59aa835ed281f18aa86522d20c218acbe2d204d36e8f

See more details on using hashes here.

File details

Details for the file asrtoolkit-0.1.15-py3-none-any.whl.

File metadata

  • Download URL: asrtoolkit-0.1.15-py3-none-any.whl
  • Upload date:
  • Size: 30.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.18.4 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.27.0 CPython/3.5.2

File hashes

Hashes for asrtoolkit-0.1.15-py3-none-any.whl
Algorithm Hash digest
SHA256 5cdb865aa03271c269464174399f1a32f2c0aa3c53f1fa9b872a7604f8fd7a42
MD5 09189456f7de9322ab9d33108d938e97
BLAKE2b-256 38e863a0d11e790e93f52c730530a3fcbd214e6da892279824f600dced08ce5e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page