$Id$
Abstract
TMQL, the query language to query Topic Map oriented information, slowly inches towards standardization. This document provides an overview over the language features of the current draft (2007-07-13).
Table of Contents
Work in TMQL has begun a while back, having been kicked off by a number of proposals. After quite some background work ― actually a reference implementation and a lot of formal research regarding optimization has been conducted. Now the language is considered stable, at least when it comes to the scope and most of the the language's features.
From here onwards, we expect that your are familiar with Topic Maps per se and how to create maps, whether it is via the classic notation XTM or others. As TMQL also can be used to generate Topic Map content, there is obviously some functional overlap with CTM, the compact notation for Topic Maps.
In this document we will stick to the theme music as running use case and
assume that we have a TM data store which contains information about various albums, musicians
and music bands (which are all artists and have persons as members). All these topics are
connected by various associations, such as is-produced-by or
is-part-of.
If you have used SQL before, then you will not be completely puzzled by the following query:
select $album where is-produced-by (production: $album, producer: tom-waits)
This query (technically a query expression) will return all albums
where Tom Waits is known to be a
producer. tom-waits is an identifier of a topic which we happen to know
to uniquely pinpoint that topic about that person in the map we query. The query processor
will try to find an association of type is-produced-by and will check
whether the topic tom-waits is playing the role
producer there. If so, it will bind the variable
$album to the topic playing the role production in
that very same association. During this one matching step, it will keep that binding between
the variable and the value.
It will so work through the whole map and will collect all these variable bindings. Because we asked so in the select clause, the query processor will return a list of the bindings for that variable. What exactly we will get returned into the application, will be discussed a bit later.
If we wanted to make the query more watertight to return albums only (and not something else produced), then we will have to add another constraint to the WHERE clause:
select $album where is-produced-by (production: $album, producer: tom-waits) & $album isa album
This time, we have ANDed (&) a second boolean expression using the
special binary predicate isa to check now additionally whether the thing
we have bound to $album is also an instance of the class
album, at least according to the map we query. It is worth noting, that
isa honors the (transitive) subclass-superclass relationship, i.e. if one
album were an instance of compilation, and
compilation, in turn, were a (direct or indirect) subclass of
album, then also these topics would be successfully bound.
If we were not fixated on Tom Waits and would instead want a list of all albums together with their producers, we can extend the wishlist in the SELECT clause:
select $album, $producer where is-produced-by (production: $album, producer: $producer) & $album isa album
Again, the processor would walk through the whole map, will find all associations of the given type and will bind the role-playing topics to their respective variables. One particular set of bindings now consists of a pair (tuple of two components) of bindings, each for one variable; all these pairs are collected in a list which is then eventually returned.
In the SELECT clauses so far, we asked for whole topics. Any query processor will hand over a complete topic data structure. Implementations most probably will choose structures according to TMDM to the application. If an application were interested only in the name of such a topic, it would have to use some API outside the scope of TMQL to extract that name.
To make the TMQL processor perform this step on its own, we can tweak the SELECT clause by adding a path expression:
select $album / name where $album isa album
Now the processor will do the name extraction for us, as we requested with /
name to find all the names for whatever is bound to
$album. Here it also does something not immediately obvious: not only
does it find the name (i.e. the structures holding the string value
for the name together with a scope and a type) but it also reduces such a structure to its
string value, which is exactly what the calling application will receive.
This process of converting a characteristic (be it a name or an occurrence) to a string is called atomification, as literals as numbers, strings, etc. are referred to as atoms in the specification.
In most cases this is convenient, in others it is not wanted. To see this, let us look at
a topic with a number of names ― all with different type or scope. So we actually
would get a whole list of names for each individual album. If we were not interested in
all names, but only in those, say, in english (en), then we want to
filter the characteristic first based on the scope and only then atomify the name:
select $album / name [ @ en ] where $album isa album
Now, if the processor would naively atomify all names to strings immediately with
/ name, then the scope information in the name would be
lost. Consequently filtering afterwards by the scope (or even the name type) would not be
possible. The language therefore regulates, that the processor will postpone the
atomification until the end of the path expression. Later we will also see that we can
also completely suppress this behavior once it does not suit our purposes.
One problem with selecting names or occurrences is, that these may or may not exist for one particular topic. Consider the query expression
select $p / name, $p / homepage
where
$p isa person
In this case we are iterating over all persons in the map and are
listing their names and their homepage occurrence(s). If a person would
not have a single name or a single homepage URL, then this person's entry would not show
up in the results at all.
In some cases this is exactly what you need, in others you would want to make sure that there is always something:
select $p / name || "no name", $p / homepage || "no URL"
where
$p isa person
The operator || symbolizes the shortcircuit OR, so
the second expression inside the SELECT clause is equivalent to the long-winded:
if $p / homepage then $p / homepage else "no URL".
Path expressions are also a convenient way to impose a sorting order on the list of tuples we return:
select $album, $producer where is-produced-by (production: $album, producer: $producer) order by $album / name [ @ en ]
That way we get albums and their producers, but the
whole list becomes a sequence sorted according to the english album title, if such existed.
The ordering can also include more than one criterion, as in
select $album, $producer where is-produced-by (production: $album, producer: $producer) order by $producer / name [ @ en ] desc, $album
Here we first sort the list of topic pairs according to the name of the producer. For the sake of demonstration only we choose descending ordering. More importantly though, for one specific producer name (in the english scope) we sort the sublist containing different albums according to the album's identifier. This may not by itself overly useful, but at least it ensures that the whole returned list always appears in the same order if we keep repeating the same query.
As you would expect, TMQL enables us to make unique the list of returned tuples and to select only slices of the whole result set. That could look like this:
select $album order by $album / name unique offset 10 limit 20
Counting starts from 0.
You may correctly argue that identifying topics by their (internal) map identifier (the TMDM model calls them item identifiers) is not an immensly robust idea if that identifier may change any time you load a map. Less brittle solutions involve to use either subject locators to directly address a subject, or subject identifiers (indicators) for indirect identification.
In the case that we know one subject identifier of the Irish pop group
U2, http://www.u2.com/, you can use that instead of an internal
identifier:
select $album where is-produced-by (production: $album, producer: <http://www.u2.com/> ~ )
The tilde ~ after the URI tells a TMQL processor to find a topic which
uses that URI as subject identifier. A similar notation allows to specify that some URL is
supposed to be interpreted as subject locator:
select $album where is-produced-by (production: $album, producer: $producer) & covers (theme : $producer, document: <http://www.u2.com/> =)
This time we use the URL of the U2 web site to identify a topic in the queried map which is
using that URL as subject locator. Via an association of type covers we
connect that to one (or more) producer(s) and these with the albums we are looking for.
Using subject identifiers will be likely the most common way to pinpoint the topic. For this purpose, TMQL allows to abbreviate topic references to items. U2's albums we can find, for instance, if their Wikipedia entry is used as subject identifier in our map:
select $opera where is-composed-by (opus: $album, creator: http://en.wikipedia.org/wiki/U2 )
Of course, subject identification also works with QNames, not only full URIs:
select $album where is-composed-by (opus: $album, creator: wp:U2 )
The only prerequisite is that the prefix wp is bound to
http://en.wikipedia.org/wiki/. This has to do with namespaces, or more
generally with ontologies which we cover later.
Since the latest TMDM includes features to store arbitrary data (and not just text as in the original XTM-based standard drafts), TMQL has provisions for data literals. The first thing it supports is denoting constants. Here TMQL borrows (stealing is such an ugly word) from RDF/S notations:
select $person / name where $person / age > "18"^^xsd:integer
This query will select names for everyone older than 18.
Writing integers, floats, or even dates and strings in this explicit form is clumsy. Therefore TMQL has adopted a few of these primitive types, specifically integers, booleans, decimals, dates, URIs (actually IRIs) and of course strings. All imported from the XSD (XML schema data types) namespace. Constants of these types can be written without the explicit typing fluff:
select $person / name
where
$person / age > 1 & # integer
$person / salary <= 100000.0 & # decimal
$person / born >= 1962-06-06T15:30 & # datetime
not ( $person / name == "Bill Gates" | # string
$person / homepage == "http://he.is.so.good.to.us/" ) # URL
As a consequence, some necessary functions for these data types have been imported into TMQL
and are part of the predefined environment. This includes also the comparison functions
which we have used above. Actually, when writing a comparison like $person / age >
18, several things happen: first we extract the age occurrence
from a person. This characteristic will then be silently atomified to its
value. If that value is already an integer, then it will be used as-is. Otherwise TMQL will
try to convert it, which may work only if the value is a string with only digits in it. Then
the two integers will be compared using the predefined function op:numeric-greater-than.
There is also another detail worth mentioning here: You may have noted that a path
expression like $person / born may return more than one characteristic;
or none, depending how many born occurrences actually exist for the given
topic. The interpretation TMQL chooses here is exists semantics,
i.e. comparisons such as the ones above are then true, when at least one value combination
can be found.
In the queries so far we made use of association predicates inside the WHERE clause. Such a association predicate
is-produced-by (production : $album, producer: $producer)
makes the processor try to find matching associations in the queried map, i.e. associations
which are of type is-produced-by and have exactly two roles, one for
production and one for producer. If an association in
the map has a third role, say, location to capture where an album has
been produced, then such association would never match the predicate.
To allow for such associations with additional roles to match, TMQL provides the ellipsis:
select $album where is-produced-by (production : $album, producer: $whoever, ...)
Association predicates actually have more implicit meaning than is obvious at first
sight. If, for example, the map contained an association of type
is-remastered-by which also connects an album with a producer and
is-remastered-by is a subtype of is-produced-by, then
also such associations would match the predicate.
Honoring subclassing also applies to roles. Had we in our queried map an association of type
is-remastered-by, but the role (type) for the album is not
production, but a subclass remastering, such
association would also match the association predicate.
If you do not care about the role type itself, you can use a wildcard:
select $album where is-produced-by (production : $album, * : $whoever, ...)
The * is the shorthand for tm:subject, which stands
for anything which is in a map.
Writing these association predicates to test for the connection between topics is precise, but often also an overkill. Sometimes it is easier to simplify an association predicate into a path expression.
The association predicate
select $album where is-produced-by (production : $album, * : $whoever, ...)
can also be rephrased as
select $album where $album <- production [ ^ is-produced-by ] -> producer == $whoever
Here we used the $album starting point and looked for its involvements in
association playing the role production. All these associations are then
filtered by their type because we are only interested in those of type
is-produced-by. Once we have this list of associations we find all
producers and compare then each of these with whatever happens to be bound to
$whoever.
But since we never actually were interested in the producer in the first place, without loss of precision we can write:
select $album where $album <- production [ ^ is-produced-by ]
Further, if you do not insist on the association type itself, or if you
know that the role production can only appear in
associations of this type, then the whole filtering is unnecessary:
select $album where $album <- production
On some occasions you will have to test whether particular things exists in a map or whether all things in a certain set have a particular property. For illustration, let us ask for all music groups in our map which have at least one female group member
select $group
where
$group isa group
& some $person in $group <- whole -> member
satisfies
$person isa female
While we iterate over all groups in the map, we find for each such group all members using
the path expression $group <- whole -> member. If only one satisfies
the condition that it is an instance of female then the existential SOME
clause is satisfied.
Conversely, we might be interested to find all boy groups, well, at least those groups where all members are male:
select $group
where
$group isa group
& every $person in $group <- whole -> member
satisfies
$person isa male
The textual overhead of the SQLish style which we have used so far may not be convenient if queries are trivial. Especially for web applications where pages have to be filled with lots of content from a TM backend a much shorter notation is more adequate.
To return all albums from the currently queried map we can simply write
// album
If we needed the english names only, then
// album / name [ @ en ]
will do just fine.
We can also, for instance, reformulate as path expression the earlier query which returned only the english titles of Tom Waits albums (line wrapping just for presentation):
// album [
. <- production
[ ^ is-produced-by ]
-> producer == tom-waits
]
/ name [ @ en ]
The processor will again start off with all albums and will subject each of them to a test
provided by the filter condition in the first [] group. That will
effectively test whether Tom Waits is one of the producers. This is accomplished by taking
each album (the dot . symbolizes the current item), checking out all
associations where this item is playing the role production, filtering
these associations so that only those remain which are of type
is-produced-by; when following the role producer from
these associations we end up with a list of topics playing that role. This list will be
compared with the list on the right side of the == symbol. While in our
case there is only one, and it is a constant anyway, this comparison is trivial and returns
true if one of the producers happens to be the same topic as that one with the identifier
tom-waits.
Only these albums where the condition is satisfied will be postprocessed in that the english name is selected from them. At the end, the name items are converted into their string values.
As we have already seen, the different language flavours, select and path expressions, can be mixed. Not so obvious is the fact that both styles are (almost) equivalent in terms of expressitivity; every select query expression can in fact be transformed into an equivalent path expression. As path expressions cannot introduce new variables, they can become quite contrived when they get more complex. It is up to the developer to choose the most appropriate combination of styles on a case by case basis.
Both these styles allow to return sequences of tuples of things to the application. This may be exactly along your line of thinking, but it does not help if you want to embed the query results into an XML application server or a batch-oriented content pipeline. To avoid that developers have to write their own template engines, TMQL allows to create XML content using a third flavour, which ― you may have guessed it ― is otherwise equivalent to the other styles. This flavour, FLWR, is inspired by XQuery and uses RETURN clauses to specify the output:
return
<albums>{
for $a in // album return
<album>{$a / name [@ en ]}</album>
}
</albums>
The RETURN clause above will create one XML fragment, here with a root element
<albums>. Nested into that will be all albums in the map, so that an
overall result may look like this:
<albums>
<album>Colorblind</album>
<album>Even_The_Waves</album>
<album>Undertow</album>
...
</albums>
The way this is achieved in the query expression is by iterating over albums in a FOR loop.
It uses a path expression // album to compute first all instances of
albums in our map. Each such album is bound, one by one, to the iteration variable
$a and with such a new binding, the body of the loop is evaluated.
Such a body itself is defined by a nested RETURN clause. It contains an element
<album> which wraps text content which we specify with the
embedded TMQL path expression $a / name [@ en ]. Like in XQuery, XML
content and query text is separated using {} brackets.
With TMQL we can also generate topic maps, either completely from scratch or by using information from the queried map. This allows to transform information from one vocabulary into another (ontology mediation).
In the following example we iterate first over all artists. In the inner loop we choose to use a path expression to compute all albums a particular artist has produced. The RETURN clause then generates for each such album a Topic Map fragment.
for $artist in // artist
for $album in $artist <- producer -> production
return """
*a
title : { $album / name }
artist : { $artist / name }
has-label (label: evil-rhocords, album: *a)
"""
The template part with """'s first generates a topic with a title and an
artist occurrence. The *a is not a TMQL, but rather a CTM feature: It
makes sure that the new topic gets a new item identifier. The reason that we do not use a
wildcard *, but rather a named wildcard is that we
need the very same identifier also in the association below.
Directly after the identifier part we add the two occurrences to this topic: one is of type
title and it has the current album name as value; the other is of type
artist and has his name as value.
Also for each such album we add a new association: It is of type
has-label and connects that very album with a ficticious record label.
As this template is expanded for every album, the individual expansions are all merged into one, possibly big, topic map. As we did not say much about the record label we might want to do so, and combine this information with our result map:
return """
evil-rhocords
- name: Evil \rho{}cords
address: 666, Hell Road, Sixth Circle City
{
# here our original query goes
}
"""