Skip to main content

OAIPMH harvester library for Invenio. Development only, not for production.

Project description

OARepo OAI-PMH harvester

An OAI-PMH harvesing library for Invenio 3.5+. The library provides initial transformation of OAI-PMH payload to an intermediary json representation which is later on transformed by a specific transformer to invenio records.

Due to their generic nature, these transformers are not part of this library but have to be provided by an application.

The progress and transformation errors are captured within the database.

For now, the library does not provide error notifications, but these will be added. A sentry might be used for the logging & reporting.

Installation

poetry add oarepo-oaipmh-harvester

Configuration

All configuration is inside the database model OAIHarvesterConfig. There is a command-line tool to add a new config:

invenio oaiharvester add \
  --code nusl \
  --name NUŠL \
  --url "http://invenio.nusl.cz/oai2d/" \
  --set global \
  --prefix marcxml 
  --transformer nusl_oai.transformer.NuslTransformer

This will register an oai-pmh harvester with code "nusl", its url, oai set and metadata prefix. Records from this harvester will be transformed by the NuslTransformer before they are written to the repository.

Options:

Usage: invenio oaiharvester add [OPTIONS]

Options:
  --code TEXT         OAI server code  [required]
  --name TEXT         OAI server name  [required]
  --url TEXT          OAI base url  [required]
  --set TEXT          OAI set  [required]
  --prefix TEXT       OAI metadata prefix  [required]
  --parser TEXT       OAI metadata parser. If not passed, a prefix-based default is used
  --transformer TEXT  Transformer class  [required]

Usage

Command-line

On command line, invoke

oaiharvester harvest nusl <optional list of oai identifiers to harvest>

Options:

  -a, --all-records  Re-harvest all records, not from the last timestamp
  --background       Start Harvest on background (via celery task), return immediately
  --dump-to TEXT     Do not import records, just dump (cache) them to this
                     directory (mostly for debugging)
  --load-from TEXT   Do not contact oai-pmh server but load the records from
                     this directory (created by dump-to option)

Celery task

@shared_task
def oai_harvest(
        harvester_id: str, 
        start_from: str, 
        load_from: str = None, 
        dump_to: str = None,
        on_background=False, 
        identifiers=None):
    """
    @param harvester_id: id of the harvester configuration (OAIHarvesterConfig) object
    @param start_from: datestamp (either YYYY-MM-DD or YYYY-MM-DDThh:mm:ss, 
           depends on the OAI endpoint), inclusive
    @param load_from: if set, a path to the directory on the filesystem where 
           the OAI-PMH data are present (in *.json.gz files)
    @param dump_to: if set, harvested metadata will be parsed from xml to json 
           and stored into this directory, not to the repository
    @param on_background: if True, transformation and storage will be started in celery tasks and can run in parallel.
           If false, they will run sequentially inside this task
    @param identifiers: if load_from is set, it is a list of file names within the directory. 
           If load_from is not set, these are oai identifiers for GetRecord. If not set at all, all records from 
           start_from are harvested 
    """

Harvest status

Each harvest creates a row in OAIHarvestRun database table containing first and last datestamps and harvest status (running, completed, errored, ...)

A run is split into a chunk of records and each chunk is represented in OAIHarvestRunBatch database table. It contains a chunk status (running, completed, warning, failed, ...) and a list of identifiers harvested and their status (ok, warning during harvesting the identifier, harvesting the identifier failed). The table also contains details of the warnings/errors.

Custom parsers and transformers

The input OAI xml is at first parsed via parsers into a json format.

MARC-XML and DC parsers are supported out of the box. See the section below if you need a different parser

The JSON is then transformed into an invenio record via a transformer class. As different repositories use different semantic of fields (even in MARC), this step can not be generic and implementor is required to provide his/her own transformer class.

Transformer

A simple transformer, that transforms just the title from MARC-XML input might look like:

from typing import List
from oarepo_oaipmh_harvester import OAITransformer, OAIRecord, OAIHarvestRunBatch

from my_record.proxies import current_service
from my_record.records.api import MyRecord

class NuslTransformer(OAITransformer):
    oaiidentifier_search_property = 'metadata_systemIdentifiers_identifier'
    # the name of service filter that accesses the record's OAI identifier
    oaiidentifier_search_path = ('metadata', 'systemIdentifiers', 'identifier')
    # path to the oai record identifier inside the record

    # invenio service that will be used to create/update the record
    record_service = current_service
    # invenio record for this record
    record_model = MyRecord 
    

    def transform_single(self, rec: OAIRecord):
        # add all your transformations here
        rec.transformed.update({
            'metadata': {
                'title': rec['24500a']
            }
        })

Parser

A parser is responsible for transforming the XML document into an intermediary JSON.

For implementation details see MarcxmlParser.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oarepo-oai-pmh-harvester-3.0.5.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oarepo_oai_pmh_harvester-3.0.5-py3-none-any.whl (20.3 kB view details)

Uploaded Python 3

File details

Details for the file oarepo-oai-pmh-harvester-3.0.5.tar.gz.

File metadata

  • Download URL: oarepo-oai-pmh-harvester-3.0.5.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for oarepo-oai-pmh-harvester-3.0.5.tar.gz
Algorithm Hash digest
SHA256 ba3f2ce25e237b15cb3828d6f032796aae873b110ea9ec24ace378c0d9b50329
MD5 44d375e00d2d57a08ce53cfd35067ebc
BLAKE2b-256 d1e696c2983f293b391105d92cac4b369fd91c0ef1bea6e61409752dbe113c9e

See more details on using hashes here.

File details

Details for the file oarepo_oai_pmh_harvester-3.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for oarepo_oai_pmh_harvester-3.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b3f3ccaa5331f11e4f0e029b66cdece6c3a0495bd6b753a5dc25abb431162dc0
MD5 35af5bd2925481d7dde3944cc3783e3e
BLAKE2b-256 f5e166389368cc6f65a0b020be9c43b5d13c35025c4f3421f5a4c1335f397e4a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page