Teemme ketterästi Suomen tutkimusaineistojen metatietovarantoa, joka tulee palvelemaan OKM:n tutkimusaineistopalveluita (esim. TPAS, IDA ja Etsin). Tässä blogissa käsitellään kehityksessä ajankohtaisia ja usein avoimia asioita, eikä mikään tässä esitetty edusta mitään virallisia linjauksia. Sen sijaan toivomme palautetta ja avointa keskustelua.

onsdag 20 juni 2018

Metax & OAI-PMH metadata harvesting

Soon to be launched Metax is the metadata repository at the heart of the Fairdata services. Since it does not have a graphical user interface, all of the interactions are handled through APIs. Metax REST API provide a set of restricted endpoints for integrated services to manipulate the state of the repository, as well as an openly available read-only data regarding datasets, data catalogs and schemas.

In order to match the capabilities of the current/old/to be deprecated Etsin service, Metax also acts as an OAI-PMH data provider. OAI-PMH specification defines a set of actions and an xml based container format for harvesting metadata according to different schemas. The API can be used for bulk harvesting, where the all the metadata is downloaded in its entirety, or for selective harvesting based on sets and/or modification date of the record.

Currently Metax exposes the following sets for harvesting:

  • att_dataset - datasets that consists of external/remote resources. 
  • ida_datasets - datasets whose content is stored and maintained by the IDA service.
  • datasets - records from both ATT and IDA catalogs
  • datacatalogs - List of available data catalogs. This includes also catalogs that are populated with externally harvested content.

Harvested records contain a header with an identifier, timestamp and possible set specification, and a metadata section that conforms to the requested metadata format. There has been a lot of internal discussion about versions and identifiers in Metax (see blog this blog post). The OAI-PMH interface uses the metadata identifier (i.e. metadata version identifier) as opposed to dataset's preferred identifier for its dataset records. Metadata identifiers are always UUIDs created internally by Metax. Some of the dataset identifiers are also URN's generated by Metax, but they can also be for example DOIs assigned and maintained outside Metax and Fairdata services. Data catalogs are purely internal concept so the same one identifier is used to refer to the metadata and the actual catalog.

Metax currently has support for simple oai_dc (link) and a more complex and usable Datacite 4.1 formats for metadata output. The OAI-PMH output is geared towards harvesting through "standard" formats, and the full data according to Metax's internal data model is available through the REST API. There are however couple of deviations from the Datacite specs. The specification only allows DOIs as the primary identifier for the dataset, but at least for now, the most prominent type of identifier is URN that has been minted by Metax itself. Also the content of the element is expressed using three letter code instead of two letters. These are hopefully small potatoes for the consumers of the data and something that can be fixed as the development of Metax moves along.

What is still missing from the implementation is the handling of deleted records. When a user removes a dataset records, Metax flags it as deleted and retains the actual record. This would allow us to implement persistent handling of deleted records in the OAI-PMH interface. The twist that complicates the implementation is that the URN resolver is going to be using the OAI-PMH interface as its source data. The resolver is responsible for providing redirection from[identifier] addresses to the Fairdata Etsin urls. Should the identifiers of deleted datasets still resolve to a page in Etsin that says that dataset was deleted? This would be beneficial for example, if the page contains contains links to newer/other versions of the deleted dataset.

Inga kommentarer:

Skicka en kommentar