TMQL Overview (31.3.2008)

Robert Barta


  

$Id$

Abstract

TMQL, the query language to query Topic Map oriented information, slowly inches towards standardization. This document provides an overview over the language features of the current draft (2007-07-13).


Table of Contents

Introduction
Setting Off
Controlling What is Returned
Path Expressions in SELECT
Default Values
Path Expressions for Ordering
Identifying Things
Predefined Datatypes and Functions
Association Predicates
Association Predicates and Path Expressions
Using Exist and All Quantification
Standalone Path Expressions
Chocolate, Vanilla, Caramel

Introduction

Work in TMQL has begun a while back, having been kicked off by a number of proposals. After quite some background work ― actually a reference implementation and a lot of formal research regarding optimization has been conducted. Now the language is considered stable, at least when it comes to the scope and most of the the language's features.

From here onwards, we expect that your are familiar with Topic Maps per se and how to create maps, whether it is via the classic notation XTM or others. As TMQL also can be used to generate Topic Map content, there is obviously some functional overlap with CTM, the compact notation for Topic Maps.

In this document we will stick to the theme music as running use case and assume that we have a TM data store which contains information about various albums, musicians and music bands (which are all artists and have persons as members). All these topics are connected by various associations, such as is-produced-by or is-part-of.

Setting Off

If you have used SQL before, then you will not be completely puzzled by the following query:

select $album
where
   is-produced-by (production: $album, producer: tom-waits)

This query (technically a query expression) will return all albums where Tom Waits is known to be a producer. tom-waits is an identifier of a topic which we happen to know to uniquely pinpoint that topic about that person in the map we query. The query processor will try to find an association of type is-produced-by and will check whether the topic tom-waits is playing the role producer there. If so, it will bind the variable $album to the topic playing the role production in that very same association. During this one matching step, it will keep that binding between the variable and the value.

It will so work through the whole map and will collect all these variable bindings. Because we asked so in the select clause, the query processor will return a list of the bindings for that variable. What exactly we will get returned into the application, will be discussed a bit later.

If we wanted to make the query more watertight to return albums only (and not something else produced), then we will have to add another constraint to the WHERE clause:

select $album
where
   is-produced-by (production: $album, producer: tom-waits)
 & $album isa album

This time, we have ANDed (&) a second boolean expression using the special binary predicate isa to check now additionally whether the thing we have bound to $album is also an instance of the class album, at least according to the map we query. It is worth noting, that isa honors the (transitive) subclass-superclass relationship, i.e. if one album were an instance of compilation, and compilation, in turn, were a (direct or indirect) subclass of album, then also these topics would be successfully bound.

If we were not fixated on Tom Waits and would instead want a list of all albums together with their producers, we can extend the wishlist in the SELECT clause:

select $album, $producer
where
   is-produced-by (production: $album, producer: $producer)
 & $album isa album

Again, the processor would walk through the whole map, will find all associations of the given type and will bind the role-playing topics to their respective variables. One particular set of bindings now consists of a pair (tuple of two components) of bindings, each for one variable; all these pairs are collected in a list which is then eventually returned.

Controlling What is Returned

In the SELECT clauses so far, we asked for whole topics. Any query processor will hand over a complete topic data structure. Implementations most probably will choose structures according to TMDM to the application. If an application were interested only in the name of such a topic, it would have to use some API outside the scope of TMQL to extract that name.

Path Expressions in SELECT

To make the TMQL processor perform this step on its own, we can tweak the SELECT clause by adding a path expression:

select $album / name
where
   $album isa album

Now the processor will do the name extraction for us, as we requested with / name to find all the names for whatever is bound to $album. Here it also does something not immediately obvious: not only does it find the name (i.e. the structures holding the string value for the name together with a scope and a type) but it also reduces such a structure to its string value, which is exactly what the calling application will receive.

This process of converting a characteristic (be it a name or an occurrence) to a string is called atomification, as literals as numbers, strings, etc. are referred to as atoms in the specification.

In most cases this is convenient, in others it is not wanted. To see this, let us look at a topic with a number of names ― all with different type or scope. So we actually would get a whole list of names for each individual album. If we were not interested in all names, but only in those, say, in english (en), then we want to filter the characteristic first based on the scope and only then atomify the name:

select $album / name [ @ en ]
where
   $album isa album

Now, if the processor would naively atomify all names to strings immediately with / name, then the scope information in the name would be lost. Consequently filtering afterwards by the scope (or even the name type) would not be possible. The language therefore regulates, that the processor will postpone the atomification until the end of the path expression. Later we will also see that we can also completely suppress this behavior once it does not suit our purposes.

Default Values

One problem with selecting names or occurrences is, that these may or may not exist for one particular topic. Consider the query expression

select $p / name, $p / homepage
  where
    $p isa person

In this case we are iterating over all persons in the map and are listing their names and their homepage occurrence(s). If a person would not have a single name or a single homepage URL, then this person's entry would not show up in the results at all.

In some cases this is exactly what you need, in others you would want to make sure that there is always something:

select $p / name || "no name", $p / homepage || "no URL"
  where
    $p isa person

The operator || symbolizes the shortcircuit OR, so the second expression inside the SELECT clause is equivalent to the long-winded: if $p / homepage then $p / homepage else "no URL".

Path Expressions for Ordering

Path expressions are also a convenient way to impose a sorting order on the list of tuples we return:

select $album, $producer
where
   is-produced-by (production: $album, producer: $producer)
order by
   $album / name [ @ en ]

That way we get albums and their producers, but the whole list becomes a sequence sorted according to the english album title, if such existed.

The ordering can also include more than one criterion, as in

select $album, $producer
where
   is-produced-by (production: $album, producer: $producer)
order by
   $producer / name [ @ en ] desc, $album

Here we first sort the list of topic pairs according to the name of the producer. For the sake of demonstration only we choose descending ordering. More importantly though, for one specific producer name (in the english scope) we sort the sublist containing different albums according to the album's identifier. This may not by itself overly useful, but at least it ensures that the whole returned list always appears in the same order if we keep repeating the same query.

As you would expect, TMQL enables us to make unique the list of returned tuples and to select only slices of the whole result set. That could look like this:

select $album
order by
   $album / name
unique offset 10 limit 20

Counting starts from 0.

Identifying Things

You may correctly argue that identifying topics by their (internal) map identifier (the TMDM model calls them item identifiers) is not an immensly robust idea if that identifier may change any time you load a map. Less brittle solutions involve to use either subject locators to directly address a subject, or subject identifiers (indicators) for indirect identification.

In the case that we know one subject identifier of the Irish pop group U2, http://www.u2.com/, you can use that instead of an internal identifier:

select $album
where
   is-produced-by (production: $album, producer: <http://www.u2.com/> ~ )

The tilde ~ after the URI tells a TMQL processor to find a topic which uses that URI as subject identifier. A similar notation allows to specify that some URL is supposed to be interpreted as subject locator:

select $album
where
   is-produced-by (production: $album, producer: $producer) &
   covers         (theme  : $producer, document: <http://www.u2.com/> =)

This time we use the URL of the U2 web site to identify a topic in the queried map which is using that URL as subject locator. Via an association of type covers we connect that to one (or more) producer(s) and these with the albums we are looking for.

Using subject identifiers will be likely the most common way to pinpoint the topic. For this purpose, TMQL allows to abbreviate topic references to items. U2's albums we can find, for instance, if their Wikipedia entry is used as subject identifier in our map:

select $opera
where
   is-composed-by (opus: $album, creator: http://en.wikipedia.org/wiki/U2 )

Of course, subject identification also works with QNames, not only full URIs:

select $album
where
   is-composed-by (opus: $album, creator: wp:U2 )

The only prerequisite is that the prefix wp is bound to http://en.wikipedia.org/wiki/. This has to do with namespaces, or more generally with ontologies which we cover later.

Predefined Datatypes and Functions

Since the latest TMDM includes features to store arbitrary data (and not just text as in the original XTM-based standard drafts), TMQL has provisions for data literals. The first thing it supports is denoting constants. Here TMQL borrows (stealing is such an ugly word) from RDF/S notations:

select $person / name
where
   $person / age > "18"^^xsd:integer

This query will select names for everyone older than 18.

Writing integers, floats, or even dates and strings in this explicit form is clumsy. Therefore TMQL has adopted a few of these primitive types, specifically integers, booleans, decimals, dates, URIs (actually IRIs) and of course strings. All imported from the XSD (XML schema data types) namespace. Constants of these types can be written without the explicit typing fluff:

select $person / name
where
       $person / age  >  1                 &  # integer
       $person / salary <= 100000.0        &  # decimal
       $person / born >= 1962-06-06T15:30  &  # datetime
   not ( $person / name     == "Bill Gates" | # string
         $person / homepage == "http://he.is.so.good.to.us/" ) # URL

As a consequence, some necessary functions for these data types have been imported into TMQL and are part of the predefined environment. This includes also the comparison functions which we have used above. Actually, when writing a comparison like $person / age > 18, several things happen: first we extract the age occurrence from a person. This characteristic will then be silently atomified to its value. If that value is already an integer, then it will be used as-is. Otherwise TMQL will try to convert it, which may work only if the value is a string with only digits in it. Then the two integers will be compared using the predefined function op:numeric-greater-than.

There is also another detail worth mentioning here: You may have noted that a path expression like $person / born may return more than one characteristic; or none, depending how many born occurrences actually exist for the given topic. The interpretation TMQL chooses here is exists semantics, i.e. comparisons such as the ones above are then true, when at least one value combination can be found.

Association Predicates

In the queries so far we made use of association predicates inside the WHERE clause. Such a association predicate

   is-produced-by (production : $album, producer: $producer)

makes the processor try to find matching associations in the queried map, i.e. associations which are of type is-produced-by and have exactly two roles, one for production and one for producer. If an association in the map has a third role, say, location to capture where an album has been produced, then such association would never match the predicate.

To allow for such associations with additional roles to match, TMQL provides the ellipsis:

select $album
where
   is-produced-by (production : $album, producer: $whoever, ...)

Association predicates actually have more implicit meaning than is obvious at first sight. If, for example, the map contained an association of type is-remastered-by which also connects an album with a producer and is-remastered-by is a subtype of is-produced-by, then also such associations would match the predicate.

Honoring subclassing also applies to roles. Had we in our queried map an association of type is-remastered-by, but the role (type) for the album is not production, but a subclass remastering, such association would also match the association predicate.

If you do not care about the role type itself, you can use a wildcard:

select $album
where
   is-produced-by (production : $album, * : $whoever, ...)

The * is the shorthand for tm:subject, which stands for anything which is in a map.

Association Predicates and Path Expressions

Writing these association predicates to test for the connection between topics is precise, but often also an overkill. Sometimes it is easier to simplify an association predicate into a path expression.

The association predicate

select $album
where
   is-produced-by (production : $album, * : $whoever, ...)

can also be rephrased as

select $album
where
   $album <- production [ ^ is-produced-by ] -> producer == $whoever

Here we used the $album starting point and looked for its involvements in association playing the role production. All these associations are then filtered by their type because we are only interested in those of type is-produced-by. Once we have this list of associations we find all producers and compare then each of these with whatever happens to be bound to $whoever.

But since we never actually were interested in the producer in the first place, without loss of precision we can write:

select $album
where
   $album <- production [ ^ is-produced-by ]

Further, if you do not insist on the association type itself, or if you know that the role production can only appear in associations of this type, then the whole filtering is unnecessary:

select $album
where
   $album <- production

Using Exist and All Quantification

On some occasions you will have to test whether particular things exists in a map or whether all things in a certain set have a particular property. For illustration, let us ask for all music groups in our map which have at least one female group member

select $group
where
   $group isa group
 & some $person in $group <- whole -> member
         satisfies
             $person isa female

While we iterate over all groups in the map, we find for each such group all members using the path expression $group <- whole -> member. If only one satisfies the condition that it is an instance of female then the existential SOME clause is satisfied.

Conversely, we might be interested to find all boy groups, well, at least those groups where all members are male:

select $group
where
   $group isa group
 & every $person in $group <- whole -> member
         satisfies
             $person isa male

Standalone Path Expressions

The textual overhead of the SQLish style which we have used so far may not be convenient if queries are trivial. Especially for web applications where pages have to be filled with lots of content from a TM backend a much shorter notation is more adequate.

To return all albums from the currently queried map we can simply write

// album

If we needed the english names only, then

// album / name [ @ en ]

will do just fine.

We can also, for instance, reformulate as path expression the earlier query which returned only the english titles of Tom Waits albums (line wrapping just for presentation):

// album [ 
           . <- production 
                   [ ^ is-produced-by ] 
                   -> producer == tom-waits 
         ]
           / name [ @ en ]

The processor will again start off with all albums and will subject each of them to a test provided by the filter condition in the first [] group. That will effectively test whether Tom Waits is one of the producers. This is accomplished by taking each album (the dot . symbolizes the current item), checking out all associations where this item is playing the role production, filtering these associations so that only those remain which are of type is-produced-by; when following the role producer from these associations we end up with a list of topics playing that role. This list will be compared with the list on the right side of the == symbol. While in our case there is only one, and it is a constant anyway, this comparison is trivial and returns true if one of the producers happens to be the same topic as that one with the identifier tom-waits.

Only these albums where the condition is satisfied will be postprocessed in that the english name is selected from them. At the end, the name items are converted into their string values.

Chocolate, Vanilla, Caramel

As we have already seen, the different language flavours, select and path expressions, can be mixed. Not so obvious is the fact that both styles are (almost) equivalent in terms of expressitivity; every select query expression can in fact be transformed into an equivalent path expression. As path expressions cannot introduce new variables, they can become quite contrived when they get more complex. It is up to the developer to choose the most appropriate combination of styles on a case by case basis.

Both these styles allow to return sequences of tuples of things to the application. This may be exactly along your line of thinking, but it does not help if you want to embed the query results into an XML application server or a batch-oriented content pipeline. To avoid that developers have to write their own template engines, TMQL allows to create XML content using a third flavour, which ― you may have guessed it ― is otherwise equivalent to the other styles. This flavour, FLWR, is inspired by XQuery and uses RETURN clauses to specify the output:

return
   <albums>{
     for $a in // album return
         <album>{$a / name [@ en ]}</album>
   }
   </albums>

The RETURN clause above will create one XML fragment, here with a root element <albums>. Nested into that will be all albums in the map, so that an overall result may look like this:

   <albums>
         <album>Colorblind</album>
         <album>Even_The_Waves</album>
         <album>Undertow</album>
         ...
   </albums>

The way this is achieved in the query expression is by iterating over albums in a FOR loop. It uses a path expression // album to compute first all instances of albums in our map. Each such album is bound, one by one, to the iteration variable $a and with such a new binding, the body of the loop is evaluated.

Such a body itself is defined by a nested RETURN clause. It contains an element <album> which wraps text content which we specify with the embedded TMQL path expression $a / name [@ en ]. Like in XQuery, XML content and query text is separated using {} brackets.

With TMQL we can also generate topic maps, either completely from scratch or by using information from the queried map. This allows to transform information from one vocabulary into another (ontology mediation).

In the following example we iterate first over all artists. In the inner loop we choose to use a path expression to compute all albums a particular artist has produced. The RETURN clause then generates for each such album a Topic Map fragment.

for $artist in // artist
  for $album in $artist <- producer -> production
  return """

   *a
   title  : { $album / name }
   artist : { $artist / name }

   has-label (label: evil-rhocords, album: *a)

"""

The template part with """'s first generates a topic with a title and an artist occurrence. The *a is not a TMQL, but rather a CTM feature: It makes sure that the new topic gets a new item identifier. The reason that we do not use a wildcard *, but rather a named wildcard is that we need the very same identifier also in the association below.

Directly after the identifier part we add the two occurrences to this topic: one is of type title and it has the current album name as value; the other is of type artist and has his name as value.

Also for each such album we add a new association: It is of type has-label and connects that very album with a ficticious record label.

As this template is expanded for every album, the individual expansions are all merged into one, possibly big, topic map. As we did not say much about the record label we might want to do so, and combine this information with our result map:

return """

   evil-rhocords
   - name: Evil \rho{}cords
   address: 666, Hell Road, Sixth Circle City

{
 # here our original query goes
}
"""