Skip to main content

This summarizer attempts to leverage Byte Pair Encoding (BPE) tokenization and the Bart vocabulary to filter text by semantic meaningfulness.

Project description

BPE Summarizer

CI

This summarizer attempts to leverage Byte Pair Encoding (BPE) tokenization and the Bart vocabulary to filter text by semantic meaningfulness.

BPE text representation is a subword level approach to tokenization which aims to efficiently reuse parts of words while retaining semantic value.

The algorithm is based on the frequency of n-gram pairs. More frequent pairs are represented by larger tokens.

This project explored the assumption that token size correlates strongly to semantic meaningfulness. This summarization approach intends to surface the most meaningful sentences with comparing token values and retaining sentences from the original text that included meaningful tokens within a specified percentile.

Install

pip install bpe-summarizer

Usage

from bpe_summarizer import bpe_summarize

bpe_summarize(article, percentile=99)

Examples

Human Summary

Building Deep Dependency Structures Using A Wide-Coverage CCG Parser

This paper describes a wide-coverage statistical parser that uses Combinatory Categorial Grammar (CCG) to derive dependency structures.

The parser differs from most existing wide-coverage treebank parsers in capturing the long-range dependencies inherent in constructions such as coordination, extraction, raising and control, as well as the standard local predicate-argument dependencies.

A set of dependency structures used for training and testing the parser is obtained from a treebank of CCG normal-form derivations, which have been derived (semi-) automatically from the Penn Treebank.\nThe parser correctly recovers over 80% of labelled dependencies, and around 90% of unlabelled dependencies.

We provide examples showing how heads can fill dependency slots during a derivation, and how long-range dependencies can be recovered through unification of co-indexed head variables.

We define predicate argument structure for CCG in terms of the dependencies that hold between words with lexical functor categories and their arguments.\n

BPE Summary

Building Deep Dependency Structures Using A Wide-Coverage CCG Parser

This paper describes a wide-coverage statistical parser that uses Combinatory Categorial Grammar (CCG) to derive dependency structures.

The parser differs from most existing wide-coverage treebank parsers in capturing the long-range dependencies inherent in constructions such as coordination, extraction, raising and control, as well as the standard local predicate-argument dependencies.

A set of dependency structures used for training and testing the parser is obtained from a treebank of CCG normal-form derivations, which have been derived (semi-) automatically from the Penn Treebank. The parser correctly recovers over 80% of labelled dependencies, and around 90% of unlabelled dependencies. However, the dependencies are typically derived from a context-free phrase structure.

Evaluation

To evaluate the quality of the summarization, we apply a semantic similarity metric, to compare auto-summarized examples with human summaries from the scisummnet dataset. Text was represented using sentence-level embeddings. Figure 1. charts the results from the BPE Summarizer as compared to widely used summarization techniques. It performed competitively and completed summarization in one one-hundredth of a second as compared to 55 seconds* over 100 samples.

Side-by-side with widely used summarizer

Fig1. Evaluation alongside a widely used summarizer

*Performance evaluation was done using a CPU, and the competitive technique was applied after stripping down to use only the summarization component.

References:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpe-summarizer-0.1.5.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bpe_summarizer-0.1.5-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file bpe-summarizer-0.1.5.tar.gz.

File metadata

  • Download URL: bpe-summarizer-0.1.5.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.8.0 Darwin/19.5.0

File hashes

Hashes for bpe-summarizer-0.1.5.tar.gz
Algorithm Hash digest
SHA256 550164310c19e7cacd671e6c9fef13562c44cb125a84c75e188901533ce7fe70
MD5 8a5284484f9bdc553b0b24d357eeb4cd
BLAKE2b-256 34830600b8ee2387a13cd08c4f71a91c249952193128c8f3d1d74d613060c92b

See more details on using hashes here.

File details

Details for the file bpe_summarizer-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: bpe_summarizer-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.8.0 Darwin/19.5.0

File hashes

Hashes for bpe_summarizer-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 0797efd2f3ee188c01ce494bc48fe2e8fd0d8679436fdf05da8229edeb1c141e
MD5 da386b40b754e8096ba00fd489e5495c
BLAKE2b-256 f24446680e465ba6bb7fe473d09fd61c1d9a508d1db0714b649eaa6be6c41c9e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page