Skip to main content

A module to parse metadata out of documents

Project description

MetadataParser is a python module for pulling metadata out of web documents.

It requires BeautifulSoup , and was largely based on Erik River’s opengraph module ( https://github.com/erikriver/opengraph ).

I needed something more aggressive than Erik’s module , so had to fork.

Installation

pip install metadata_parser

Features

  • it pulls as much metadata out of a document as possible

  • you can set a ‘strategy’ for finding metadata ( ie, only accept opengraph or page attributes )

Notes

  1. This requires BeautifulSoup 4.

  2. For speed, it will instantiate a BeautifulSoup parser with lxml , and fall back to ‘none’ (the internal pure python) if it can’t load lxml

  • It is HIGHLY recommended that you install lxml for usage. It is considerably faster. Considerably faster. *

You should also use a very recent version of lxml. I’ve had problems with segfaults on some versions < 2.3.x ; i would suggest using the most recent 3.x if possible.

The default ‘strategy’ is to look in this order:

og,dc,meta,page og = OpenGraph dc = DublinCore meta = metadata page = page elements

You can specify a strategy as a comma-separated list of the above.

The only 2 page elements currently supported are:

<title>VALUE</title> -> metadata[‘page’][‘title’] <link rel=”canonical” href=”VALUE”> -> metadata[‘page’][‘link’]

The MetadataParser object also wraps some convenience functions , which can be used otherwise , that are designed to turn alleged urls into well formed urls.

For example, you may pull a page:

http://www.example.com/path/to/file.html

and that file indicates a canonical url which is simple “/file.html”.

This package will try to ‘remount’ the canonical url to the absolute url of “http://www.example.com/file.html” . It will return None if the end result is not a valid url.

This all happens under-the-hood, and is honestly really useful when dealing with indexers and spiders.

Usage

From an URL

>>> import metadata_parser
>>> page = metadata_parser.MetadataParser(url="http://www.cnn.com")
>>> print page.metadata
>>> print page.get_field('title')
>>> print page.get_field('title',strategy='og')
>>> print page.get_field('title',strategy='page,og,dc')

From HTML

>>> HTML = """<here>"""
>>> page = metadata_parser.MetadataParser(html=HTML)
>>> print page.metadata
>>> print page.get_field('title')
>>> print page.get_field('title',strategy='og')
>>> print page.get_field('title',strategy='page,og,dc')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metadata_parser-0.6.0.tar.gz (8.1 kB view details)

Uploaded Source

File details

Details for the file metadata_parser-0.6.0.tar.gz.

File metadata

File hashes

Hashes for metadata_parser-0.6.0.tar.gz
Algorithm Hash digest
SHA256 4d256ac8c2429b744c29385bbe9ae18f0ad80a4b59beb378d5d54b2823b0bf4b
MD5 ee168887989a70b7306574e32c02e90f
BLAKE2b-256 b056a3e328438335e01673d29238437dc7e1a8c9c2f8d37b3d7c4e6a83c2f80e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page