Skip to main content

A library to extract the main content from html. Developed for information on LLM and for feeding data into LangChain and LlamaIndex.

Project description

main_content_extractor

Description

This library is designed for extracting only the main content from HTML.
It was developed for obtaining information related to LLM and for data input to LangChain and LlamaIndex.

Since this library contains element information and hierarchy information of HTML, it is useful when utilizing them.
For example, it can be helpful in obtaining a list of links or headers from the main content.

While trafilatura is an excellent library for main content extraction, it has issues such as missing necessary data or inability to output HTML.
To address these problems, this library exists.

The sequence of main content extraction is as follows:

image
In addition to HTML format, output in Text format and Markdown format is also supported. This is to make it easier to output data in a format that is more convenient for LLM.

The extraction of main content uses trafilatura.
Since trafilatura cannot output in HTML format, it is output in XML format containing HTML information and then converted to HTML.
The conversion from XML to HTML is irreversible and does not perfectly match the original data.

Installation

pip install MainContentExtractor

HowToUse

import requests
from main_content_extractor import MainContentExtractor

# Get HTML using requests
url = "https://developer.mozilla.org/ja/docs/Web"
response = requests.get(url)
response.encoding = 'utf-8'
content = response.text

# Get HTML with main content extracted from HTML
extracted_html = MainContentExtractor.extract(content)

# Get HTML with main content extracted from Markdown
extracted_markdown = MainContentExtractor.extract(content, output_format="markdown")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

MainContentExtractor-0.0.4.tar.gz (5.0 kB view hashes)

Uploaded Source

Built Distribution

MainContentExtractor-0.0.4-py3-none-any.whl (5.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page