A library to extract the main content from html. Developed for information on LLM and for feeding data into LangChain and LlamaIndex.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

main_content_extractor

Description

This library is designed for extracting only the main content from HTML.
It was developed for obtaining information related to LLM and for data input to LangChain and LlamaIndex.

Since this library contains element information and hierarchy information of HTML, it is useful when utilizing them.
For example, it can be helpful in obtaining a list of links or headers from the main content.

While trafilatura is an excellent library for main content extraction, it has issues such as missing necessary data or inability to output HTML.
To address these problems, this library exists.

The sequence of main content extraction is as follows:

In addition to HTML format, output in Text format and Markdown format is also supported. This is to make it easier to output data in a format that is more convenient for LLM.

The extraction of main content uses trafilatura.
Since trafilatura cannot output in HTML format, it is output in XML format containing HTML information and then converted to HTML.
The conversion from XML to HTML is irreversible and does not perfectly match the original data.

Installation

pip install MainContentExtractor

HowToUse

import requests
from main_content_extractor import MainContentExtractor

# Get HTML using requests
url = "https://developer.mozilla.org/ja/docs/Web"
response = requests.get(url)
response.encoding = 'utf-8'
content = response.text

# Get HTML with main content extracted from HTML
extracted_html = MainContentExtractor.extract(content)

# Get HTML with main content extracted from Markdown
extracted_markdown = MainContentExtractor.extract(content, output_format="markdown")

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.4

Dec 10, 2023

0.0.3

Dec 10, 2023

0.0.2

Dec 10, 2023

0.0.1

Dec 10, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

MainContentExtractor-0.0.4.tar.gz (5.0 kB view hashes)

Uploaded Dec 10, 2023 Source

Built Distribution

MainContentExtractor-0.0.4-py3-none-any.whl (5.7 kB view hashes)

Uploaded Dec 10, 2023 Python 3

Hashes for MainContentExtractor-0.0.4.tar.gz

Hashes for MainContentExtractor-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`697acc05909fb2f786d9cf7d4ff5bfbf14e4c3359c3a6eadc7ed4403fc2e66e5`
MD5	`7ee8d9663c17dbb737498d1421c6fd3f`
BLAKE2b-256	`01de634b620e845f48bf27cbe66816e60f0fdb12414f77c8916af60aec508b0d`

Hashes for MainContentExtractor-0.0.4-py3-none-any.whl

Hashes for MainContentExtractor-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77684179436e28eb2e19be26657cb2bbd7c1f9213a2c3ee163a8f9dfbca64107`
MD5	`b4b41514a88112eb1b8be45980d83ed2`
BLAKE2b-256	`766232c33101b179d373d753d7c892b19f2ec22978b6c3c36d17a4a61d2169b6`