How to merge multiple RDF files into a single RDF file? The first idea
would be to convert each RDF file in ntriples and just concatenate
them using unix cat
utility, right? No, it doesn’t work with blank
nodes (or BNodes)! BNodes from different files with the same ID would
be merged as a single resource and this is not the expected semantics,
BNodes from different files are different resources, even if they have
the same id.
The rapper
is a utility from the package Redland. Below I am
presented the files and the number of triples on each one.
$ rapper -c -i ntriples wordnet-en-fixed.nt rapper: Parsing returned 3517504 triples $ rapper -c -i ntriples own-pt-fixed.nt rapper: Parsing returned 824916 triples
The oldest tool to support merging of RDF files is CWM. CWM is written in python and its performance is really bad. The command below hasn’t finish after 5 minutes.
/usr/local/cwm-1.2.1/cwm --ntriples own-pt-fixed.nt wordnet-en-fixed.nt > tudo-cwm.nt
Next tool that I tried was $RDFpro$. The performance was excellent,
only 11 seconds! But we must add a parameter -w
to force BNodes in
input files to be renamed to avoid possible clashes. Actually, it
doesn’t make sense to me why this is not the default behaviour.
$ rdfpro @r -w own-pt-fixed.nt wordnet-en-fixed.nt @w tudo-pro.nt 14:45:53(I) 4342420 triples read (377077 tr/s avg) 14:45:53(I) 4342420 triples written (377077 tr/s avg) 14:45:53(I) Done in 11 s
Next tool, riot
from the Jena library. The performance was not bad,
it took twice the time of $RDFpro$ but it finished. The only
problem is that it complained about some IRI that no other tool
complained.
$ time riot own-pt-fixed.nt wordnet-en-fixed.nt > tudo-riot.nt 14:51:14 WARN riot :: [line: 282756, col: 1 ] Bad IRI: <https://w3id.org/own-pt/wn30-pt/instances/word-IJsselmeer> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC. 14:51:14 WARN riot :: [line: 282756, col: 1 ] Bad IRI: <https://w3id.org/own-pt/wn30-pt/instances/word-IJsselmeer> Code: 56/COMPATIBILITY_CHARACTER in PATH: Bad character ... real 0m27.398s user 0m29.905s sys 0m1.751s
I don’t like warnnings so I tried the safe path. I converted the
ntriple file with these strange IRIs to RDF/XML and called riot
again. No warnnings this time, good!
$ rapper -i ntriples -o rdfxml own-pt-fixed.nt > own-pt-fixed.rdf rapper: Serializing with serializer rdfxml rapper: Parsing returned 824916 triples $ riot --time own-pt-fixed.rdf wordnet-en-fixed.nt > tudo-riot.nt own-pt-fixed.rdf : 14.84 sec 824,916 triples 55,602.32 TPS wordnet-en-fixed.nt : 21.82 sec 3,517,504 triples 161,175.95 TPS Total : 36.66 sec 4,342,420 triples 118,451.17 TPS
But the output produced does have some errors! The IRIs are not encoded as the way the ntriples specification requires.
$ rapper -c -i ntriples tudo-riot.nt rapper: Parsing URI tudo-riot.nt with parser ntriples rapper: Error - URI tudo-riot.nt:117668 column 55 - Non-printable ASCII character 195 (0xC3) found. rapper: Error - URI tudo-riot.nt:117668 column 56 - Non-printable ASCII character 162 (0xA2) found.
By the way, for the future, I will use $RDFpro$.