Merging RDF files

How to merge multiple RDF files into a single RDF file? The first idea would be to convert each RDF file in ntriples and just concatenate them using unix cat utility, right? No, it doesn’t work with blank nodes (or BNodes)! BNodes from different files with the same ID would be merged as a single resource and this is not the expected semantics, BNodes from different files are different resources, even if they have the same id.

The rapper is a utility from the package Redland. Below I am presented the files and the number of triples on each one.

$ rapper -c -i ntriples wordnet-en-fixed.nt
rapper: Parsing returned 3517504 triples

$ rapper -c -i ntriples own-pt-fixed.nt
rapper: Parsing returned  824916 triples

The oldest tool to support merging of RDF files is CWM. CWM is written in python and its performance is really bad. The command below hasn’t finish after 5 minutes.

/usr/local/cwm-1.2.1/cwm --ntriples own-pt-fixed.nt wordnet-en-fixed.nt > tudo-cwm.nt

Next tool that I tried was $RDFpro$. The performance was excellent, only 11 seconds! But we must add a parameter -w to force BNodes in input files to be renamed to avoid possible clashes. Actually, it doesn’t make sense to me why this is not the default behaviour.

$ rdfpro @r -w own-pt-fixed.nt wordnet-en-fixed.nt @w tudo-pro.nt
14:45:53(I) 4342420 triples read (377077 tr/s avg)
14:45:53(I) 4342420 triples written (377077 tr/s avg)
14:45:53(I) Done in 11 s

Next tool, riot from the Jena library. The performance was not bad, it took twice the time of $RDFpro$ but it finished. The only problem is that it complained about some IRI that no other tool complained.

$ time riot own-pt-fixed.nt wordnet-en-fixed.nt > tudo-riot.nt
14:51:14 WARN riot :: [line: 282756, col: 1 ] Bad IRI: <─▓sselmeer> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
14:51:14 WARN riot :: [line: 282756, col: 1 ] Bad IRI: <─▓sselmeer> Code: 56/COMPATIBILITY_CHARACTER in PATH: Bad character

real	0m27.398s
user	0m29.905s
sys	0m1.751s

I don’t like warnnings so I tried the safe path. I converted the ntriple file with these strange IRIs to RDF/XML and called riot again. No warnnings this time, good!

$ rapper -i ntriples -o rdfxml own-pt-fixed.nt  > own-pt-fixed.rdf
rapper: Serializing with serializer rdfxml
rapper: Parsing returned 824916 triples

$ riot --time own-pt-fixed.rdf wordnet-en-fixed.nt > tudo-riot.nt
own-pt-fixed.rdf : 14.84 sec  824,916 triples  55,602.32 TPS
wordnet-en-fixed.nt : 21.82 sec  3,517,504 triples  161,175.95 TPS
Total : 36.66 sec  4,342,420 triples  118,451.17 TPS

But the output produced does have some errors! The IRIs are not encoded as the way the ntriples specification requires.

$ rapper -c -i ntriples tudo-riot.nt

rapper: Parsing URI tudo-riot.nt with parser ntriples
rapper: Error - URI tudo-riot.nt:117668 column 55 - Non-printable ASCII character 195 (0xC3) found.
rapper: Error - URI tudo-riot.nt:117668 column 56 - Non-printable ASCII character 162 (0xA2) found.

By the way, for the future, I will use $RDFpro$.