metadata-parser

A module to parse metadata out of documents

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language
- Python
Topic
- Software Development :: Libraries :: Python Modules
- Text Processing :: Markup :: HTML

Project description

MetadataParser is a python module for pulling metadata out of web documents.

It requires BeautifulSoup , and was largely based on Erik River’s opengraph module ( https://github.com/erikriver/opengraph ).

I needed something more aggressive than Erik’s module , so had to fork.

Installation

pip install metadata_parser

Features

it pulls as much metadata out of a document as possible
you can set a ‘strategy’ for finding metadata ( ie, only accept opengraph or page attributes )

Notes

This requires BeautifulSoup 4.
For speed, it will instantiate a BeautifulSoup parser with lxml , and fall back to ‘none’ (the internal pure python) if it can’t load lxml

It is HIGHLY recommended that you install lxml for usage. It is considerably faster. Considerably faster. *

You should also use a very recent version of lxml. I’ve had problems with segfaults on some versions < 2.3.x ; i would suggest using the most recent 3.x if possible.

The default ‘strategy’ is to look in this order:: og,dc,meta,page og = OpenGraph dc = DublinCore meta = metadata page = page elements

You can specify a strategy as a comma-separated list of the above.

The only 2 page elements currently supported are:: <title>VALUE</title> -> metadata[‘page’][‘title’] <link rel=”canonical” href=”VALUE”> -> metadata[‘page’][‘link’]

The MetadataParser object also wraps some convenience functions , which can be used otherwise , that are designed to turn alleged urls into well formed urls.

For example, you may pull a page:

http://www.example.com/path/to/file.html

and that file indicates a canonical url which is simple “/file.html”.

This package will try to ‘remount’ the canonical url to the absolute url of “http://www.example.com/file.html” . It will return None if the end result is not a valid url.

This all happens under-the-hood, and is honestly really useful when dealing with indexers and spiders.

Usage

From an URL

>>> import metadata_parser
>>> page = metadata_parser.MetadataParser(url="http://www.cnn.com")
>>> print page.metadata
>>> print page.get_field('title')
>>> print page.get_field('title',strategy='og')
>>> print page.get_field('title',strategy='page,og,dc')

From HTML

>>> HTML = """<here>"""
>>> page = metadata_parser.MetadataParser(html=HTML)
>>> print page.metadata
>>> print page.get_field('title')
>>> print page.get_field('title',strategy='og')
>>> print page.get_field('title',strategy='page,og,dc')

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language
- Python
Topic
- Software Development :: Libraries :: Python Modules
- Text Processing :: Markup :: HTML

Release history Release notifications | RSS feed

1.0.0

Aug 30, 2025

0.13.1

Jun 11, 2025

0.13.0

May 24, 2025

0.12.3

Mar 24, 2025

0.12.2

Jan 22, 2025

0.12.1

Apr 10, 2024

0.12.0

Jun 8, 2023

0.10.5

Mar 26, 2021

0.10.4

Oct 20, 2020

0.10.0

Apr 28, 2019

0.9.23

Mar 6, 2019

0.9.22

Mar 5, 2019

0.9.21

Jul 13, 2018

0.9.20

May 31, 2018

0.9.19

Apr 3, 2018

0.9.18

Jan 8, 2018

0.9.17

Dec 14, 2017

0.9.16

Dec 14, 2017

0.9.15

Dec 5, 2017

0.9.14

Aug 7, 2017

0.9.12

Jul 20, 2017

0.9.11

Jul 7, 2017

0.9.10

Jul 7, 2017

0.9.7

May 19, 2017

0.9.6

May 9, 2017

0.9.5

Jan 26, 2017

0.9.4

Jan 19, 2017

0.9.3

Jan 18, 2017

0.9.1

Jan 17, 2017

0.9.0

Jan 16, 2017

0.8.3

Jan 6, 2017

0.8.1

Dec 22, 2016

0.8.0

Dec 20, 2016

0.7.4

Dec 14, 2016

0.7.3

Dec 14, 2016

0.7.2

Nov 16, 2016

0.7.1

Nov 16, 2016

0.7.0

Nov 3, 2016

0.6.18

Sep 1, 2016

0.6.17

May 5, 2016

0.6.16

Oct 1, 2015

0.6.15

Aug 3, 2015

0.6.14

Mar 31, 2015

0.6.11

Nov 6, 2014

0.6.10

Nov 6, 2014

0.6.9

Nov 5, 2014

0.6.8

Aug 14, 2014

0.6.7

Aug 14, 2014

0.6.6

Aug 6, 2014

0.6.5

Jul 30, 2014

0.6.3

Jul 14, 2014

This version

0.6.0

Jun 6, 2014

0.5.8

Jan 20, 2014

0.5.6

Jan 17, 2014

0.5.4

Nov 4, 2013

0.4.13

May 18, 2013

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metadata_parser-0.6.0.tar.gz (8.1 kB view details)

Uploaded Jun 6, 2014 Source

File details

Details for the file metadata_parser-0.6.0.tar.gz.

File metadata

Download URL: metadata_parser-0.6.0.tar.gz
Upload date: Jun 6, 2014
Size: 8.1 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for metadata_parser-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`4d256ac8c2429b744c29385bbe9ae18f0ad80a4b59beb378d5d54b2823b0bf4b`
MD5	`ee168887989a70b7306574e32c02e90f`
BLAKE2b-256	`b056a3e328438335e01673d29238437dc7e1a8c9c2f8d37b3d7c4e6a83c2f80e`

See more details on using hashes here.

metadata-parser 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Features

Notes

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes