VIVO Apps and Tools Webinar

Introduction

Yerterday I presented for the Apps and Tools working group my workflow to prepared data to be inserted into FGV VIVO instance. Since some people asked be to share the links and the file that I used to guide the presentation, I made this post.

This page is generated from a org file that I export to HTML and further processed with Jekyll. The process of use org files with jekyll is outlined here. I plan to improve this workflow, but it is working for my personal website and for the websites of the courses that I teach at FGV.

The Toolset

This is the non comprehensive list of tools that I use. I am listing here the main tools that come up into my mind that should be interesting to others.

Data Sources

We have to main sources of data: (1) the FGV researchers’ curricula vitae from Lattes Platform and; (2) the FGV digital library.

During the webinar I shared my screen and presented the web interface that CNPq provides for researchers update their resumes. I also shown my curriculum vitae and discussed why I do not consider Brazilian Researchers curricula vitae open data given the captcha that blocks crawlers to get XML files from the Lattes website in a batch mode.

I forgot to mention during the webinar but one of my old dreams is to convice CNPq to add RDFa or https://schema.org microformats into the HTML pages of the curricula vitae. This would not only allow crawlers to easier process the data but could also facilitate the maintainance of the system. My current XSLT transformation of Lattes XML to RDF cloud provide the starting point to RDFa embeeding.

The FGV Digital Library runs Dspace. The publications and collections metadata are easly collected from Dspace using the OAI-PMH protocol.

Lattes XML Files

As I said in the previous section, the curricula vitae of Brazilian Researchers are not open data, that is, they are not public available in structured format. They are only public available as HTML pages in the Lattes website, with limited search interface. The only way to universities and research institutions get the curricula vitae of their researchers in a structured format is to sign an aggrement with CNPq. FGV has signed this aggrement and we have one server authorized to access CNPq web server to retrive the XML files of all curriculae that have informed any professional activity with FGV.

To transform the Lattes files to RDF, I use a XSLT transformation that I developed few years ago. The XSLT is freely available at github in the repository Semantic Lattes.

In this repo, I made also available the DTD that I try hard to keep up-to-date. Unfortunately, until recently, CNPq did not make public annoucement of chances in the structure of the XML files that they produce, so I had to adapt the DTD whenever I identify changes. I have just found in the end of this page that CNPq finally realized the importance of making available the updated DTD. Nevertheless, the DTD in the link of this page is outdated. At least, using this DTD to validate the 489 FGV’s curriculae I got more than 100 erros but using my DTD to validate the same files I got only 2 errors. Considering that those 489 files were produced my CNPq, we have two options: (1) the DTD is outdated; or (2) the code that produces the XML files has bugs. The two erros that I find using my DTD occur only with two curriculae that were not updated in the last 2 years.

After download the XML files, the general idea to process them and produce the RDF files is outlined in the code below:

for f in $ROOT/ontos/xml/*.xml; do
    ID=$(basename $f .xml)
    echo Processing $ID
    xmllint --noout --dtdvalid $REPO/LMPLCurriculo.DTD $f 2>> error.log
    xsltproc --stringparam ID $ID $REPO/lattes.xsl $f > $ID.rdf  
done

After that, I import the RDF files to Allegro Graph making each curriculum a separated graph so I can easly identify the provenance of each triple. The importation is done using the agload utility. The load process takes aprox. 2 minutes:

Load finished 487 sources in 00:02:03 (123.02 seconds).  
Triples added: 1,690,538 
Average Rate: 13742.00 tps

Data Deduplication

I briefly commented about the deduplication of records during the webinar. I do have to take care of removing duplicated resources about the same entity. Considering a thesis defended by and student at FGV whom have as advisor a professor at FGV. I will have metadata (triples) about this thesis from three different sources: (1) the RDF produced from the advisor’s curriculum lattes XML; (2) the RDF produce from the student’s curriculum lattes XML; and (3) the RDF obtained from the FGV Digital Library.

The current code that I use to identify duplicated resources is a Common Lisp library that is easily used if placed inside the local-projects directory of a Quicklisp instalation.

I can write an entire article only about deduplication in RDF. I am still thinking hard about this problem and really would like to find better alternatives. One can note that deduplication of nodes in a RDF graph should not be done type by type as I am doing now. The rules to identify resources as being refering the same entity could dependent each other. That is, the deduplication of instances of foaf:Person can activate the rule to deduplicate instances of bibo:Article and vice-versa. It would be better to have a kind of fixed point transformation in the RDF graph that could keep clustering nodes until nothing more can be done. As a logician, I am very interested in approach this problem in a more declarative and deductive way.

I also have to note that owl:sameAs semantics doesn’t help here. I do use owl:sameAs to mark the nodes that should be merged but I have to merge the nodes after all owl:sameAs triples are produced. I do this with two SPARQL construct queries:

delete { ?s1 ?p ?o . }
insert { ?s2 ?p ?o . }
where {
  ?s1 owl:sameAs ?s2 .
  ?s1 ?p ?o .
  filter( !sameTerm(?p, owl:sameAs) )
}
delete { ?x ?p ?o1 . }
insert { ?x ?p ?o2 . }
where {
  ?o1 owl:sameAs ?o2 .
  ?x ?p ?o1 .
  filter( !sameTerm(?p, owl:sameAs) )
}

Note that the filters block the propagation of the owl:sameAs triples.

Mapping Lattes RDF to VIVO RDF

To map the Lattes RDF model produced by my XSLT to the expected VIVO RDF model, I have to look carefully to each instance of data. This mapping is not completed but at this point I have already mapped most of the data about people, publication, research areas and departaments.

To work on the rules and queries to transform the data, I used the query and data browsing tools developed by Franz: Gruff and AllegroGraph WebView. During the webinar I presented both systems.

The mapping is developed as rules that were easly tested with CWM. One example of rules is

{ ?dept foaf:member ?person ;
        rdf:type foaf:Group . } => 
{ [ vivo:relates ?dept ;
    vivo:relates ?person ;
    a vivo:FacultyPosition ;
    rdfs:label "Professor Adjunto"@pt ] . } .

Rules like the one above are placed in an n3 file and executed by CWM that receives the rule file and the data file and produces the data output file. Unfortunately, CWM does not have good performance and I haven’t even tried to use it with all the data. I develop the rules and test them with only one curriculum vitae file.

Once I finish to test the rules, I rewrite them as SPARQL queries. The one above becomes:

insert 
{ graph <http://www.fgv.br/vivo/import/> 
  {            
   [ vivo:relates ?dept ;
     vivo:relates ?person ;
      a vivo:FacultyPosition ;
     rdfs:label "Professor Adjunto"@pt ] . 
  }
}
where
{ ?dept foaf:member ?person ;
        rdf:type foaf:Group . 
}

Note that: (1) the query produces blank nodes that need to be transformed into normal nodes before loaded into VIVO; (2) All created triples are placed in a separated graph; and (3) if this query is executed twice it will generate duplicated and dispensable triples. This is the most important limitation of using SPARQL for me. CWM will only execute a rule whenever necessary and the rules do not have to explicit declare any condition to avoid unnecessary creation of triples.

It is still not clear to me if all SPARQL queries can be rewrited to prevent non necessary creation of triples. Moreover, I don’t want to have too complicated SPARQL queries to maintain.

More on the next post.