Skip to main content

Module for automatic summarization of text documents and HTML pages.

Project description

image

Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains simple evaluation framework for text summaries. Implemented summarization methods:

Here are some other summarizers:

Installation

Make sure you have Python 2.7/3.3+ and pip (Windows, Linux) installed. Run simply (preferred way):

$ [sudo] pip install sumy

Or for the fresh version:

$ [sudo] pip install git+git://github.com/miso-belica/sumy.git

Usage

Sumy contains command line utility for quick summarization of documents.

$ sumy lex-rank --length=10 --url=http://en.wikipedia.org/wiki/Automatic_summarization # what's summarization?
$ sumy luhn --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy edmundson --language=czech --length=3% --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy --help # for more info

Various evaluation methods for some summarization method can be executed by commands below:

$ sumy_eval lex-rank reference_summary.txt --url=http://en.wikipedia.org/wiki/Automatic_summarization
$ sumy_eval lsa reference_summary.txt --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy_eval edmundson reference_summary.txt --language=czech --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy_eval --help # for more info

Python API

Or you can use sumy like a library in your project.

# -*- coding: utf-8 -*-

from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals

from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words


LANGUAGE = "czech"
SENTENCES_COUNT = 10


if __name__ == "__main__":
    url = "http://www.zsstritezuct.estranky.cz/clanky/predmety/cteni/jak-naucit-dite-spravne-cist.html"
    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
    # or for plain text files
    # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
    stemmer = Stemmer(LANGUAGE)

    summarizer = Summarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)

    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)

Tests

Setup:

$ pip install pytest pytest-cov

Run tests via

$ py.test-2.7 && py.test-3.3 && py.test-3.4 && py.test-3.5

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sumy-0.7.0.tar.gz (55.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sumy-0.7.0-py2.py3-none-any.whl (78.6 kB view details)

Uploaded Python 2Python 3

File details

Details for the file sumy-0.7.0.tar.gz.

File metadata

  • Download URL: sumy-0.7.0.tar.gz
  • Upload date:
  • Size: 55.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for sumy-0.7.0.tar.gz
Algorithm Hash digest
SHA256 c60ba25ed4ce10b4868676f611b147cb8866560d77ea4066ce7106a43ec08409
MD5 dd44b81c25afa06e7266e27ea144fc64
BLAKE2b-256 801e9d2cddfacfa6aa79a6d2ffe810bb2b33aa0de9b4beb0e3e8f449da2bf016

See more details on using hashes here.

File details

Details for the file sumy-0.7.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for sumy-0.7.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 afe121e9a3ba05237f7de73f09fd04c26aabf0ec563592ebec1398dc4d801232
MD5 90642b8f0a33d5ce7e1ed7e7e7cfe8eb
BLAKE2b-256 2f0e30ebd2fb0925537a3b2f9fccf0a13171ba557e9450b1702d278159d3c592

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page