Skip to main content

Use sentence embeddings to create naturally coherent segments of text akin to paragraphs.

Project description

cohesive

cohesive is a lightweight segmenter that uses sentence embeddings to split documents into naturally coherent segments akin to paragraphs.

Installation

You can install cohesive using pip:

pip install cohesive

Using cohesive

To start using cohesive, simply import Cohesive and create a new instance of the client:

from cohesive import Cohesive

# By default, cohesive uses the paraphrase-MiniLM-L6-v2 model, which produces good
# results, but you can pass the name of any model into the Cohesive constructor.
cohesive = Cohesive("msmarco-distilbert-cos-v5")

# Then, all you need to do is call the create_segments method and pass in an
# array of sentences.
cohesive.create_segments(sentences)

At the present time, cohesive is only compatible with the sentence-transformers library but additional encoders will be added in the future.

Finetuning cohesive

cohesive users can finetune several parameters, which all impact the final segmentation results in different ways. Here is a quick summary:

  • window_size: Sets the size of the context window for generating segments. Defaults to 4.
  • louvain_resolution: Used by the Louvain community detection algorithm to partition sentences into segments. Default is 1.
  • framework: The framework to use for calculating similarity scores. Choose between scipy and sklearn. Default is "scipy".
  • show_progress_bar: Flag to display the progress bar from sentence-transformers whilst generating embeddings. Defaults to False.
  • balanced_window: If True, the context window is split evenly between preceding and subsequent sentences, otherwise it only looks at subsequent sentences. Defaults to False.
  • exponential_scaling: Flag to use exponential scaling when calculating similarity scores. Defaults to False.
  • max_sentences_per_segment: Maximum number of sentences per segment. Default is None.

To modify the parameters, simply pass in the appropriate parameter name and value when you call the create_segments method:

# Via create_segments
cohesive.create_segments(sentences, window_size=3, exponential_scaling=True)

Viewing the segments

When create_segments has finished, cohesive will print a summary of the total number of segments that were created.

There are several methods for interacting with the generated segments.

# View a string representation of the consolidated Segment and Sentence objects
cohesive.segments

# List that contains the content of each segment.
cohesive.get_segment_contents()

# View the start and end indices of sentences within a segment.
cohesive.get_segment_boundaries()

# Print the contents of each segment to the console or Notebook.
cohesive.print_segment_contents()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cohesive-0.1.6.tar.gz (10.4 kB view hashes)

Uploaded Source

Built Distribution

cohesive-0.1.6-py3-none-any.whl (11.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page