Skip to main content

OAIPMH harvester

Project description

OARepo OAI-PMH harvester

An OAI-PMH harvesing library for Invenio 12+. The library provides initial transformation of OAI-PMH payload to an intermediary json representation which is later on transformed by a specific transformer to the format of invenio records.

Due to their generic nature, these transformers are not part of this library but have to be provided by an application.

The progress and transformation errors are captured within the database.

For now, the library does not provide error notifications, but these will be added. Sentry might be used for the logging & reporting.

Installation

pip install oarepo-oaipmh-harvester
pip install <your transformer library>

Configuration

All configuration is inside the OAIHarvesterRecord record. There is a command-line tool to add a new record:

invenio oarepo oai harvester add nusl \
    --name "NUSL harvester" \
    --url http://invenio.nusl.cz/oai2d/ \
    --set global \
    --prefix marcxml \
    --loader sickle \
    --transformer marcxml \
    --transformer nusl \
    --writer 'service{service=nr_documents}'

This will register an oai-pmh harvester with code "nusl", its url, oai set and metadata prefix. Records from this harvester will be loaded with the sickle loader (default loader if not specified) and at first transformed from marcxml to json format and subsequently by NUSL transformer to get nr_documents compatible json.

The json is then used by service writer to create/update the target record.

Usage

Command-line

On command line, invoke

invenio oarepo oai harvester run nusl <optional list of oai identifiers to harvest>

Options:

  --all-records      Re-harvest all records, not from the last timestamp
  --on-background    Run harvest on background (via celery task)
  --identifier       Harvest the passed identifier/s

You can also pass arguments for "havester add" to override the defaults from the configuration.

Harvest status

Harvester uses 2 additional database tables to store the progress of the harvest:

  • Run represents a single run of the harvester
  • OAI record is a link between harvested record and its original. If there are errors harvesting the record, it is still created and its errors field is filled with the error

Custom parsers and transformers

Transformer

A piece of code that gets a StreamEntry (or a StreamBatch) instance, processes it and returns modified StreamEntry. An example is a MarcXML transformer that takes the string with xml representation of the entry and transforms it into simple json representation {abcxy: value(s)}, where abcxy is marc field code.

See oarepo_runtime.datastream.transformers package for StreamEntry/StreamBatch interfaces.

The transformer needs to be registered:

# mypkg.transformers

from .impl import MyTransformer

my_transformer = {"class": MyTransformer, "params": {
       # default parameters that will go to the MyTransformer constructor
}}

And setup.cfg:

# setup.cfg

[options.entry_points]
oarepo.oaipmh.transformers =
    my_transformer = mypkg.transformers:my_transformer

Then you can use my_transformer when creating your harvester.

Reader

A reader is responsible for fetching records and creating a stream of StreamEntry items. See oarepo_runtime.datastreams.readers for details. Then register the reader into oarepo.oaipmh.readers entry point with the same syntax as above.

Permissions

Harvester permissions

The harvester has its own permission presets call harvesters. By default, the permissions are set to enable all operations by the repository owner.

If the repository owner wants to delegate the harvesting to a user, he needs to set up the harvestors propety on the harvester record to the user ids/emails of the users that should be allowed to harvest the records. This will allow these users to see the harvester settings and run the harvest.

Run permissions

The run record is created by the harvester and it is not possible to change the permissions of the run record. The permissions are set to allow the repository owner to see the run record and the user that started the run. The run record is not visible to other users.

OAI record permissions

The OAI record is created by the harvester and it is not possible to change the permissions of the OAI record. The permissions are set to allow the repository owner to see the OAI record and the user that started the run. The OAI record is not visible to other users.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oarepo_oai_pmh_harvester-5.0.11.tar.gz (72.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oarepo_oai_pmh_harvester-5.0.11-py3-none-any.whl (112.7 kB view details)

Uploaded Python 3

File details

Details for the file oarepo_oai_pmh_harvester-5.0.11.tar.gz.

File metadata

File hashes

Hashes for oarepo_oai_pmh_harvester-5.0.11.tar.gz
Algorithm Hash digest
SHA256 17bc3cef09c3f83de8a2153a6073dec854806568fc1314321161e6423e035935
MD5 a6aea111ec91260d4812a9c07e4d5860
BLAKE2b-256 ebc5576ad52eb5cdf4e5c6254fd75bc179314e5a7294ddd3c206a20582894bd7

See more details on using hashes here.

File details

Details for the file oarepo_oai_pmh_harvester-5.0.11-py3-none-any.whl.

File metadata

File hashes

Hashes for oarepo_oai_pmh_harvester-5.0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 966bf34735633b73045eea28fa27c445c2b2149096f211b0a512830b233c3bef
MD5 8a6b2ac6a73977dbc020604c5aa32f35
BLAKE2b-256 e621c1f272e21512174e2f7a94750b608fa2c620bc1df2ee536f8d5d952571c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page