Using OAI-PMH protocol for Data Ingest into VIVO Instances

Press key to advance.

Author: Alexandre Rademaker (IBM Research and FGV)
Violeta Ilik (Texas A&M University)

Created: 2014-09-05 Fri 10:42

Emacs 24.3.1 (Org mode 8.2.7b)


The Motivation
  • Digital Libraries (Repositories) are among other excellent source of data to VIVO Instances
  • Data is usually curated by specialists (librarians)
  • Almost all institutions are implementing digital libraries for thesis, dissertations, technical reports etc.
  • Both FGV and Texas A&M use DSpace
OAI-PMH Protocol
  • Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) version 2.0 (http://www.openarchives.org)
  • Low-barrier mechanism for repository interoperability.
  • Data Providers are repositories that expose structured metadata via OAI-PMH.
  • Service Providers make OAI-PMH service requests to harvest that metadata.
  • OAI-PMH is a set of six verbs or services that are invoked within HTTP.
  • Implementations as data providers: DSpace, OJS, OCS etc.
Metadata Formats
  • Each case will have its own problems and benefits
  • METS doesn't expose language tags. See here and here
  • DIM exposes language tags. See here.
  • OAI_DC doesn't exposes the qualified fields (i.e. contributor.advisor and contributor.other are mixed into contributor)
Our Cases
Data Quality

Some examples:

  • Incomplete qualifiers. See here.
  • Subjects, topics and controlled vocabularies. See again here.
  • Lack of identifiers for entities. A person is only identified by a name.
Retrieving thesis from DSpace
  • Decide which metadata format to use.
    1. If we use the rdf metadata format, we just need to transform XML/RDF to RDF-VIVO using SPARQL.
      { graph <http://www.fgv.br/vivo/import/> 
         [ vivo:relates ?thesis ;
           vivo:relates ?person ;
           a vivo:Authorship ;
           vivo:rank "1"^^xsd:int ] . 
        } }
      where { ?thesis dc:author ?person ;
                rdf:type ow:Publication . }
    2. If we use other format like METS, we need to transform XML to RDF-VIVO using XSLT. See the input and the transformation.
Current XSLT mets2vivo Limitations

METS differences in DSpace versions

  <mods:title>A DSP embedded optical naviagtion system</mods:title>


  <mods:titleInfo>A internacionalização de uma
    empresa brasileira de serviços de saúde na década
    de 1990: estudo de caso sobre AMIL</mods:titleInfo>
Next Steps
  • Retrieve sets (collections)
  • Apply the transformation
  • Deduplication and validation the data in a external triple store
  • Ingest data into VIVO
Data ingest into VIVO
  • Two options: (1) SPARQL Update API; (2) filegraph directory.
  • To clean the vitro-kb-2 before a new ingesting:
update=clear graph <http://vitro.mannlib.cornell.edu/default/vitro-kb-2>
  • To ingest the new-data.rdf
update=LOAD <http://nlp.emap.fgv.br/new-data.rdf> into 
 graph <http://vitro.mannlib.cornell.edu/default/vitro-kb-2>
  • The command line
curl -i -d 'email=MYUSER' -d 'password=MYPASS' -d '@FILE.sparql' 

Thank you!


Slides will be available at http://arademaker.github.com. Our VIVO instance VIVO@FGV. The code is at https://github.com/arademaker/oai-client