Wolf in Sheep's Clothings (Part II)

A while back I experimented with the Java client API of AllegroGraph to talk to a triple store.

The latest release (V3.1.1) also sports a Python client which immediately aroused my interest, and that for several reasons:

  • It is using a new HTTP protocol with the AllegroGraph server, one using JSON.
  • And its API is following that of Sesame.

The following simply goes through the basic motions, as also described by the Python tutorial.

Installation

Setting things up under Debian Linux is trivial as the requirements are already packaged:

apt-get install python-cjson python-pycurl

For Mac OS X I chose the path via macports.org.

port install python_select
python_select python25
port install py25-curl
port install py25-cjson

Once you have unpacked the agraph distribution, you will find a python directory holding the client code. To make this effective, make sure that your python interpreter picks that up:

export PYTHONPATH=/where/ever/agraph-fje-3.1.1/python

For some reason under OS X the Python code is kept under DISTFJE/python/.

Then also the pydoc command works:

pydoc franz.openrdf.repository.repository

Fill in the dots

To start the server the documentation instructs you to use (one line):

./AllegroGraphServer --new-http-port 8080
                     --new-http-auth sacklpicker:catbert
                     --new-http-catalog /tmp/scratch/

That works, but only if you are under Linux. Under OSX the program will loudly complain that it does not understand the --new-http-catalog option. Actually with some guesswork the option there is --new-http-db as an ./AllegroGraphServer --help cryptically insinuates:

--new-http-db directory
  :: .....................

The HTTP authentication can also be dropped. I cannot imagine that anyone will seriously consider to expose the server to the open Internet.

Another caveat concerns the use of ~ in the invocation. A

./AllegroGraphServer --new-http-catalog ~/tmp/scratch/

is internally translated into

./AllegroGraphServer --new-http-catalog /tmp/scratch/

which caused some head, uhm, scratches.

Open Sesame, Open Sesame

From now on everything is downhill: First you get a server object, just to investigate the catalogs which are available there:

from franz.openrdf.sail.allegrographserver import AllegroGraphServer

server = AllegroGraphServer("localhost", port=8080)
print server.listCatalogs()

Amazingly enough, this always works, even without any authentication. Only if you want to open one particular catalog you definitely need proper authorization:

server.username = "sacklpicker"
server.password = "catbert"

catalog = server.openCatalog('scratch')
print catalog.listRepositories()

As each catalog can hold several repositories (models in RDF-speak), you will have to narrow in on one first. That can be a newly created one or an existing one. In the AllegroGraph tradition, all this is controlled with proper constants

from franz.openrdf.repository.repository   import Repository

r = Repository(catalog, "catlitter", Repository.RENEW)
r.initialize()

The initialization seems to be important and necessary. It is not packaged into the constructor.

Nota bene: The Repository.RENEW fails under the OS X version with "There is already a store named 'catlitter'". Looks like a bug to me, as it works flawlessly under Linux.

Factory

From the repository one can clone a factory, obviously an artificial construct which allows you to mince RDF objects:

f           = r.getValueFactory()
subject     = f.createURI("http://cata.log/sacklpicker")

If you want to have more explicit control over the namespace handling, then a slightly cumbersome

ns = "http://cata.log/"
sacklpicker = f.createURI(namespace = ns, localname="sacklpicker")
Cat         = f.createURI(namespace = ns, localname="Cat")

is the way to go. I just wonder whether the namespace handling could have been moved into the factory object. I mean, if we already have that.

Transacting with the Store

From the repository object you also have to generate a connection object first, at least if you want to modify or query the repository.

c = r.getConnection()

The code reveals that there can actually only be one per repository:

def getConnection(self):
      if not self.connection:
         self.connection = RepositoryConnection(self)
      return self.connection

So this is just following Sesame conventions.

Once you got hold of the connection, you can insert triples into the store:

from franz.openrdf.vocabulary.rdf import RDF
c.add (sacklpicker, RDF.TYPE, Cat)

hates  = f.createURI(namespace = ns, localname="hates")
tomcat = f.createURI(namespace = ns, localname="tomcat")
c.add (sacklpicker, hates, tomcat)
c.add (tomcat, RDF.TYPE, Cat)

And you can ask for the current repository (uhm, connection, whatever) size:

print "Triple count: ", c.size()

And via that connection you can launch your queries:

from franz.openrdf.query.query import QueryLanguage
q = c.prepareTupleQuery(QueryLanguage.SPARQL,
                        """
                        PREFIX c: <http://cata.log/>
                        SELECT ?cat WHERE {?cat a c:Cat .}
                        "
"")

try:
    ts = q.evaluate();
    for t in ts:
        print t.getValue("cat")
finally:
    ts.close();

The documentation also advises you to close that connection object at the end. So we will do exactly that:

c.close()

Bulk Loading

Of course it is also possible to load triples from a file (e.g. in RDF/XML N3 format) and send it to the server:

path    = "/tmp/geo.nt"
baseURI = "http://rho.whatever/"
from franz.openrdf.rio.rdfformat import RDFFormat
c.add(path, base=baseURI, format=RDFFormat.NTRIPLES, contexts=None)

print "Triple count: ", c.size()

A protocol trace with wireshark tells me that the way it is implemented is that the file is parsed locally on the client, its content is encoded into JSON and that is sent to the HTTP server. Quite surprisingly, that is not as slow as one would suspect.

Still, you will run into problems once you hit a certain size limit, in my case meager 500000 triples.

pycurl.error: (52, 'Empty reply from server')

Obviously, larger files will have to be chunked by the application for the time being, until this is handled by the AllegroGraph Python client.

Also notable is that the server grows quite drastically in memory size

PID USER   VIRT  RES  COMMAND
13889 rho    730m 367m  ./AllegroGraphServer ....

One problem I ran into was that from then on the store behaves strangely. As soon as I tried to RENEW it with

r = Repository(catalog, "catlitter", Repository.RENEW)

I always received an error

500 Record number 1118021 is too
large for this store. Store size is 1118020.

The only way to get rid of that was to remove the repository manually from the file system, restart the server and repopulate the content. Not pretty.

Bulk Export

No problems I had with getting the triples out of the store:

from franz.openrdf.rio.rdfwriter import  NTriplesWriter
c.exportStatements(None, RDF.TYPE, Cat, False, NTriplesWriter(None))

So What Now?

Well, my main motivation to track the progress of AllegroGraph is to find a large-scale backend store for my geosemantic information, that is time series for environment observations and derived values (virtual sensors).

For that I would need encode geospatial information. That is covered for Lisp clients and for the Java client. I still know too little about AllegroGraph to see how this can be done in Python. And then ultimately reproduced in a Perl client.

Posted In