Skip to main content

Python framework for fast Vector Space Modelling

Project description

==============================================
gensim -- Topic Modelling in Python
==============================================

|Travis|_
|Wheel|_

.. |Travis| image:: https://img.shields.io/travis/RaRe-Technologies/gensim/develop.svg
.. |Wheel| image:: https://img.shields.io/pypi/wheel/gensim.svg

.. _Travis: https://travis-ci.org/RaRe-Technologies/gensim
.. _Downloads: https://pypi.python.org/pypi/gensim
.. _License: http://radimrehurek.com/gensim/about.html
.. _Wheel: https://pypi.python.org/pypi/gensim

Gensim is a Python library for *topic modelling*, *document indexing* and *similarity retrieval* with large corpora.
Target audience is the *natural language processing* (NLP) and *information retrieval* (IR) community.

Features
---------

* All algorithms are **memory-independent** w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),
* **Intuitive interfaces**

* easy to plug in your own input corpus/datastream (trivial streaming API)
* easy to extend with other Vector Space algorithms (trivial transformation API)

* Efficient multicore implementations of popular algorithms, such as online **Latent Semantic Analysis (LSA/LSI/SVD)**,
**Latent Dirichlet Allocation (LDA)**, **Random Projections (RP)**, **Hierarchical Dirichlet Process (HDP)** or **word2vec deep learning**.
* **Distributed computing**: can run *Latent Semantic Analysis* and *Latent Dirichlet Allocation* on a cluster of computers.
* Extensive `documentation and Jupyter Notebook tutorials <https://github.com/RaRe-Technologies/gensim/#documentation>`_.


If this feature list left you scratching your head, you can first read more about the `Vector
Space Model <http://en.wikipedia.org/wiki/Vector_space_model>`_ and `unsupervised
document analysis <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_ on Wikipedia.

Installation
------------

This software depends on `NumPy and Scipy <http://www.scipy.org/Download>`_, two Python packages for scientific computing.
You must have them installed prior to installing `gensim`.

It is also recommended you install a fast BLAS library before installing NumPy. This is optional, but using an optimized BLAS such as `ATLAS <http://math-atlas.sourceforge.net/>`_ or `OpenBLAS <http://xianyi.github.io/OpenBLAS/>`_ is known to improve performance by as much as an order of magnitude. On OS X, NumPy picks up the BLAS that comes with it automatically, so you don't need to do anything special.

The simple way to install `gensim` is::

pip install -U gensim

Or, if you have instead downloaded and unzipped the `source tar.gz <http://pypi.python.org/pypi/gensim>`_ package,
you'd run::

python setup.py test
python setup.py install


For alternative modes of installation (without root privileges, development
installation, optional install features), see the `documentation <http://radimrehurek.com/gensim/install.html>`_.

This version has been tested under Python 2.7, 3.5 and 3.6. Support for Python 2.6, 3.3 and 3.4 was dropped in gensim 1.0.0. Install gensim 0.13.4 if you *must* use Python 2.6, 3.3 or 3.4. Support for Python 2.5 was dropped in gensim 0.10.0; install gensim 0.9.1 if you *must* use Python 2.5). Gensim's github repo is hooked against `Travis CI for automated testing <https://travis-ci.org/RaRe-Technologies/gensim>`_ on every commit push and pull request.

How come gensim is so fast and memory efficient? Isn't it pure Python, and isn't Python slow and greedy?
--------------------------------------------------------------------------------------------------------

Many scientific algorithms can be expressed in terms of large matrix operations (see the BLAS note above). Gensim taps into these low-level BLAS libraries, by means of its dependency on NumPy. So while gensim-the-top-level-code is pure Python, it actually executes highly optimized Fortran/C under the hood, including multithreading (if your BLAS is so configured).

Memory-wise, gensim makes heavy use of Python's built-in generators and iterators for streamed data processing. Memory efficiency was one of gensim's `design goals <http://radimrehurek.com/gensim/about.html>`_, and is a central feature of gensim, rather than something bolted on as an afterthought.

Documentation
-------------
* `QuickStart`_
* `Tutorials`_
* `Tutorial Videos`_
* `Official Documentation and Walkthrough`_

Citing gensim
-------------

When `citing gensim in academic papers and theses <https://scholar.google.cz/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:u-x6o8ySG0sC>`_, please use this BibTeX entry::

@inproceedings{rehurek_lrec,
title = {{Software Framework for Topic Modelling with Large Corpora}},
author = {Radim { R}eh{
u}{ r}ek and Petr Sojka},
booktitle = {{Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks}},
pages = {45--50},
year = 2010,
month = May,
day = 22,
publisher = {ELRA},
address = {Valletta, Malta},
language={English}
}

----------------

Gensim is open source software released under the `GNU LGPLv2.1 license <http://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html>`_.
Copyright (c) 2009-now Radim Rehurek

|Analytics|_

.. |Analytics| image:: https://ga-beacon.appspot.com/UA-24066335-5/your-repo/page-name
.. _Analytics: https://github.com/igrigorik/ga-beacon
.. _Official Documentation and Walkthrough: http://radimrehurek.com/gensim/
.. _Tutorials: https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials
.. _Tutorial Videos: https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#videos
.. _QuickStart: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/gensim%20Quick%20Start.ipynb

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gensim-1.0.0.tar.gz (14.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

gensim-1.0.0.win-amd64-py3.5.exe (6.0 MB view details)

Uploaded Source

gensim-1.0.0.win-amd64-py2.7.exe (5.6 MB view details)

Uploaded Source

gensim-1.0.0.win32-py3.5.exe (5.9 MB view details)

Uploaded Source

gensim-1.0.0.win32-py2.7.exe (5.6 MB view details)

Uploaded Source

gensim-1.0.0-cp35-cp35m-win_amd64.whl (5.4 MB view details)

Uploaded CPython 3.5mWindows x86-64

gensim-1.0.0-cp35-cp35m-win32.whl (5.4 MB view details)

Uploaded CPython 3.5mWindows x86

gensim-1.0.0-cp27-cp27m-win_amd64.whl (5.4 MB view details)

Uploaded CPython 2.7mWindows x86-64

gensim-1.0.0-cp27-cp27m-win32.whl (5.4 MB view details)

Uploaded CPython 2.7mWindows x86

File details

Details for the file gensim-1.0.0.tar.gz.

File metadata

  • Download URL: gensim-1.0.0.tar.gz
  • Upload date:
  • Size: 14.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for gensim-1.0.0.tar.gz
Algorithm Hash digest
SHA256 afcb8a5d8105d89a422045e12bf4a03596404551cf535a287719e7aebf3b69e1
MD5 9dfccc8cb76d64c6ad18b628632d2e01
BLAKE2b-256 7291d1a29c8ba866bed6c554a4039a842dd6fddc9bb78f335f3f9efd7dc9292e

See more details on using hashes here.

File details

Details for the file gensim-1.0.0.win-amd64-py3.5.exe.

File metadata

File hashes

Hashes for gensim-1.0.0.win-amd64-py3.5.exe
Algorithm Hash digest
SHA256 e9de11112052ffce253e5c6fc0c3f68288afd3d9afed44084adabe7cd131b87f
MD5 dc9586e63ed524fc0db487b673d0d509
BLAKE2b-256 67d193af53e5b18985b65806a9011f91fc912585963713a6d2194a37f2706f1c

See more details on using hashes here.

File details

Details for the file gensim-1.0.0.win-amd64-py2.7.exe.

File metadata

File hashes

Hashes for gensim-1.0.0.win-amd64-py2.7.exe
Algorithm Hash digest
SHA256 cf87621533471a56b0c8760f5f76c876b2849f6aa1cf138fb88da8663d6a8269
MD5 acc5dcb02ffda027508878ca253f19bd
BLAKE2b-256 7b44e4a4c2f46e0479cb385a00b300f42e79ea0ac26010d9831c3adbb6eb5dc0

See more details on using hashes here.

File details

Details for the file gensim-1.0.0.win32-py3.5.exe.

File metadata

File hashes

Hashes for gensim-1.0.0.win32-py3.5.exe
Algorithm Hash digest
SHA256 f3be21b8908919575b1a7fb5e7d7322ba7882486d11fd732adc50ef46bac433b
MD5 3aacad140167dd51492173d4aea50ee0
BLAKE2b-256 daf4a125c0e87e012f957e20ff7fdee816a7dcb56ec3c5bc63fb149e8de4b1be

See more details on using hashes here.

File details

Details for the file gensim-1.0.0.win32-py2.7.exe.

File metadata

File hashes

Hashes for gensim-1.0.0.win32-py2.7.exe
Algorithm Hash digest
SHA256 e235f400fb7248dc0336563ad9513a984245152c62dc0ccd2100183ed27979e7
MD5 60aa5a50310114bec67443eb3efd8834
BLAKE2b-256 54671301db532694071ef4c667272e7d35956f0074f26ac17be640dd13a00b6c

See more details on using hashes here.

File details

Details for the file gensim-1.0.0-cp35-cp35m-win_amd64.whl.

File metadata

File hashes

Hashes for gensim-1.0.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 d94bd581625da4367cd51a1e745587e7cb857fc9e430254090f1c0bb5209d7ee
MD5 1d54756e30c46e2a30924a1d95ce1064
BLAKE2b-256 736b632cec4d308c1564b6536fbb1034de0c81777577381cab6900981fc7f33d

See more details on using hashes here.

File details

Details for the file gensim-1.0.0-cp35-cp35m-win32.whl.

File metadata

File hashes

Hashes for gensim-1.0.0-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 447b356c408b83ec9f62bee525baf15d49fa986b6871b537bd56fd601abd2e82
MD5 30af314a7c6523eb898c164f60657e21
BLAKE2b-256 e5c90c8e3e1e8f4f4b82a5f10b787ee276280f60e71b1d7d6d42c94da5c59626

See more details on using hashes here.

File details

Details for the file gensim-1.0.0-cp27-cp27m-win_amd64.whl.

File metadata

File hashes

Hashes for gensim-1.0.0-cp27-cp27m-win_amd64.whl
Algorithm Hash digest
SHA256 1bc82f51e038941d3f5eac1b8401e83f1b101373f8c15efe60f27e06938bfbb0
MD5 78000c5e5ee9a4e4d6918b700ff0bafe
BLAKE2b-256 cce1934267fa89cdc02e0642e02e6a844e8aa8eb7affdd248348e955af73e1d5

See more details on using hashes here.

File details

Details for the file gensim-1.0.0-cp27-cp27m-win32.whl.

File metadata

File hashes

Hashes for gensim-1.0.0-cp27-cp27m-win32.whl
Algorithm Hash digest
SHA256 5045f7d848d0db0749d3b9ffa1ead11d4313e48ef51158152866aa4e4c4de7f1
MD5 196194222b5a1c0608891dc480891849
BLAKE2b-256 10727f5f1770bf7f4190589320285518bd3037cd2d0173e9fab7c7eb384bfc91

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page