Skip to main content

Biterm Topic Model

Project description

Biterm Topic Model

This is a simple Python implementation of the awesome Biterm Topic Model. This model is accurate in short text classification. It explicitly models the word co-occurrence patterns in the whole corpus to solve the problem of sparse word co-occurrence at document-level.

Simply install by:

pip install biterm

Load some short texts and vectorize them via sklearn.

    from sklearn.feature_extraction.text import CountVectorizer

    texts = open('./data/reuters.titles').read().splitlines()[:50]
    vec = CountVectorizer(stop_words='english')
    X = vec.fit_transform(texts).toarray()

Get the vocabulary and the biterms from the texts.

    from biterm.utility import vec_to_biterms

    vocab = np.array(vec.get_feature_names())
    biterms = vec_to_biterms(X)

Create a BTM and pass the biterms to train it.

    from biterm.cbtm import oBTM

    btm = oBTM(num_topics=20, V=vocab)
    topics = btm.fit_transform(biterms, iterations=100)

Save a topic plot using pyLDAvis and explore the results! (also see simple_btml.py)

    from biterm.btm import oBTM

    btm = oBTM(num_topics=20, V=vocab)
    topics = btm.fit_transform(biterms, iterations=100)

pyLDAvis Visualization

Inference is done with Gibbs Sampling and it's not really fast. The implementation is not meant for production. But if you have to classify a lot of texts you can try using online learning. Use the Cython version to speed up performance a bit.

import numpy as np
import pyLDAvis
from biterm.cbtm import oBTM 
from sklearn.feature_extraction.text import CountVectorizer
from biterm.utility import vec_to_biterms, topic_summuary # helper functions

if __name__ == "__main__":

    texts = open('./data/reuters.titles').read().splitlines()

    # vectorize texts
    vec = CountVectorizer(stop_words='english')
    X = vec.fit_transform(texts).toarray()

    # get vocabulary
    vocab = np.array(vec.get_feature_names())

    # get biterms
    biterms = vec_to_biterms(X)

    # create btm
    btm = oBTM(num_topics=20, V=vocab)

    print("\n\n Train Online BTM ..")
    for i in range(0, len(biterms), 100): # prozess chunk of 200 texts
        biterms_chunk = biterms[i:i + 100]
        btm.fit(biterms_chunk, iterations=50)
    topics = btm.transform(biterms)

    print("\n\n Visualize Topics ..")
    vis = pyLDAvis.prepare(btm.phi_wz.T, topics, np.count_nonzero(X, axis=1), vocab, np.sum(X, axis=0))
    pyLDAvis.save_html(vis, './vis/online_btm.html')

    print("\n\n Topic coherence ..")
    topic_summuary(btm.phi_wz.T, X, vocab, 10)

    print("\n\n Texts & Topics ..")
    for i in range(len(texts)):
        print("{} (topic: {})".format(texts[i], topics[i].argmax()))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biterm-0.1.5.tar.gz (79.7 kB view hashes)

Uploaded Source

Built Distribution

biterm-0.1.5-cp36-cp36m-macosx_10_7_x86_64.whl (62.3 kB view hashes)

Uploaded CPython 3.6m macOS 10.7+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page