$Id$
Abstract
This documents covers most of the relevant TMQL language features and design rationale. THIS IS WORK IN PROGRESS.
Table of Contents
TMQL is an expression language which allows to extract content from topic maps, or more generally from an any backend store which is organized along the Topic Maps paradigm.
Like any other query language, TMQL has two components: One for detecting certain patterns in
the data and a second where the result is computed. For the pattern detection TMQL offers not
only purely pattern detection, such as all the people who are involved in a
is-married-with association, but also navigation axes along
which further information is investigated.
On the output side you not only can produce tabular data ― TMQL calls these tuple sequences ― but also XML content and Topic Maps content. Because of the latter TMQL and CTM have quite some functional overlap and care has been taken that the two languages can cooperate nicely.
Implementations will of course provide an API for applications to invoke queries on a particular map (or a set of maps). For demonstration only, here an exemplary code fragment how a Perl implementation might offer TMQL functionality:
# first get hold of some map
use TM;
my $map = new TM (file => 'here.ctm');
# then produced a query object and later eval it
use TM::QL;
my $query = new TM::QL ('select $p from ... $shoesize ...');
my $results = $query->eval ('%_' => $map, '$shoesize' => 42);A query object would be built by handing in a valid TMQL expression to the constructor. At this stage processors are free to perform in-depth analysis of the expression, say, for optimization; the standard remains silent on that part.
If the query expression contained syntactical or semantic errors which can be detected without executing the expression, then this would be the right moment for implementations to report this fact to the application. What precisely these errors are, is written down in the standard, but it refrains from specifying a list of error messages and also from saying how these errors are reported, say as exceptions or as error codes.
Once a map has been established (here it is read from a file in CTM notation), it can be
passed into the query processor. Additional parameters can be used to import values into the
query expression. %_ just so happens to be a variable understood by the
processor to be bound to the map to be queried. There is actually no obligation to hand in any
map, in the same way as there is no upper limit on the number of maps to be handed in. This
set of variable bindings is the initial binding context during the query
execution.
Apart from that and the expression itself only an ur-environment is conveyed to the query processor. Here are ontological commitments such as 'what is a function', 'what is a prefix', 'what is a data type' and 'what primitive data types' are there. This ur-environment can be constant, even hardcoded. Implementations are allowed to extend it by, say, offering more data types and/or more functions and predicates.
On the outgoing side a TMQL expression always creates a tuple sequence which it will return to the calling application. In the case of above Perl solution that tuple sequence can then be iterated over, whereby each tuple is simply one list of primitive values:
my $results = $query->eval ('%_' => $map, '$shoesize' => 42);
foreach my $tuple (@$results) {
foreach my $value (@$tuple) {
print $value;
}
print "\n";
}
If the query is used to build an XML fragment, then implementations may choose to return the individual fragments inside a tuple sequence, each fragment in one tuple. Or, alternatively, everything already merged into one single XML fragment. The standard does not mandate this.
Similar the case when a topic map is generated within a TMQL expression. Also here the implementation can choose to return content in small pieces or one chunk, depending on memory usage, speed or architectural decisions.
The aspect that a TMQL processor evaluates an expression to a final result is quite important, from a theoretical as well as from a pragmatic viewpoint. First, it makes the formal definition of the language rather straightforward once there is a proper (and simple) formal model of what a topic map is. While this may appear a purely academic exercise it opens the avenue to use term rewriting techniques to perform significant optimizations just by pre-evaluating subexpressions or even pruning complete subexpressions.
Once there is a precise, formal meaning for each expression more analysis can reveal how particular expressions can be rewritten into equivalent expressions with lower computational cost. This option will become even more attractive when implementations can leverage knowledge about the application domain, such as one provided by TMCL, OWL or any serious ontology language.
Once the query is evaluated implementations may also choose to leverage parallel architectures, such as grids, clusters or even online processor resources from Google or Amazon. This is only possible if the evaluation of one expression only depends on the results of its subexpressions, but not from any other context.
TMQL has 3 'surface syntaxes' to adapt to different usage scenarios.
The more convential syntax flavor leans itself towards SQL which is familiar with most developers, industrial or academic. Accordingly, the developer has to warp his brain a bit to determine first what he needs to be delivered, only to write down later how the things he gets are related:
select $p / name, $p / shoesize
where
$p isa person
& $p / shoesize > 42
In contrast to this, the FLWR syntax is more constructive, in that a developer constructs the result bottom-up:
for $p in // Person
where
$p / shoesize > 42
return
( $p / name, $p / shoesize )
One advantage is that all variables are clearly defined, and their scope is evident. Also the flow of thought might more reflect what many (procedural and functional) programmers are used to, namely to iterate over lists, filtering for interesting parts and returning a newly constructed list.
One of the real benefits of the FLWOR syntax is that it is also more obvious to generate XML content, especially as this looks almost the same as in XQuery:
return
<persons>{
for $p in // Person
where
$p / shoesize > 42
return
<person id="{$p}">
<name>{ $p / name }</name>
<shoesize>{ $p / shoesize }</shoesize>
</person>
}</persons>
Alone the fact that the content inside the RETURN clause starts with <
indicates the intention to return XML fragments and not tuple sequences or Topic Map
content. Otherwise the semantics is similar: each fragment is completed with the current
value of $p to a XML fragments. All such fragments are concatenated and
embedded into the <persons> element.
The FLWOR syntax can also naturally be used to generate TM content within the RETURN clause. At that we look later.
The third syntax flavour, path expression, is convenient when the query is actually very short and the result is quite simple. To get, for instance, the name(s) of all persons with a large footprint we can write
// Person [ . / shoesize > 42 ] / name
Path expressions (PEs) follow a natural left-to-right flow: first all instances of class Person are computed from the queried map, then each is tested whether it has a shoesize larger than 42, and then only the 'surviving' person instances are used to compute their name(s).
In terms of expressitivity PEs are equal with SELECT and FLWOR expressions. This is undermined with the fact that all surface syntaxes can be mapped into one and the same primitive path expression language. The readability of PEs, though, quickly suffers once things get a bit more convoluted. And with SELECT expressions they share that only tuples can be produced as results.
All numbered productions in the TMQL specification form the core syntax. It covers the complete TMQL functionality, all the surface syntaxes. Writing practical TMQL expressions in this canonical form is highly inconvenient, though. To find all instances of a concept Person one would have to write as path expression
%_ [ @_[0] >> classes == Person ]
%_ stands for the currently queried map, or here for all things (topics
and associations) inside this map. Each of these items is then taken individually
(symbolized by the first and only column in the current tuple @_) and is
used as starting point of a navigation along the types axis. Accordingly, all classes (their
direct and indirect superclasses) are computed and are compared with the topic denoted via
Person. If there is only a single overlap, then this item will survive
the filtering process.
As the first column of @_ is needed quite often, @_[0]
can be abridged with .. For >> classes == Person TMQL
also allows to write ^ Person, so that the expression collapses to
%_ [ ^ Person ]
That can be further reduced to
%_ // Person
and then to
// Person
as %_ is the default map anyway.
Many such shortcuts have been devised, although it is still open how many will be in the finalized language. In any case, all these shortcuts do not change ANYTHING semantically. They are all defined in terms of syntactic transformations and can be completely resolved during parsing time.
TMQL has adopted a handful of native times, notably strings, integer and decimals to name the most usual suspects. While the complete list will be aligned with CTM, the compact notation, it is suspected that IRIs and dates will be included.
Adoption for a data implies that a TMQL processor will have to take care of the syntax of string representations of each type, how such a string is deserialized into an object of the data type and the other way round; and it will have to provide a set of functions for manipulating objects.
For numeric types the rules (integers and decimals) are as one is used from other programming languages:
42 # this is the answer to all things
3.14 # this is as exact as it gets
# some random operators
42 + 3.14
42 / 3.14
- 3.14
42 > 3.14
There will be not many operations on decimals, not only to keep the implementation costs low for TMQL developers, but also because any implementor can offer additional functions via an extension mechanism.
Strings will be fairly conventional, too. But there will be not many operations on them, except the ubiquous concatenation and the odd regular expression operator which matches strings against a regexp pattern.
"The End is Not Near"
# some random operators
"The End is " + "Near"
# I'd wish, but the Java mafia will not allow this
$string =~ /End.+Near/
Probably the most interesting addition will be that of dates as first-class types in the language. This not only allows dates to be written down in one of the 100000 available formats, it also will allow modest date operations, including those to compare dates:
2005-10-16T10:29Z
$today > 2005-10-16
One of the most central parts of Topic Maps is about addressing topics (and the subjects they stand for). On the one hand there are the two methods which TMDM mandates, those of subject indication (indirect addressing) and that of subject locators (direct addressing) where the subject itself has a IRI associated to it. But there are also other means, one of them being that the 'internal' item identifier is used; the other uses a property-oriented approach, namely, that you know something about the topic which makes it 'unique enough' to identify the topic in question.
Obviously, the most reliably way to address the topic of your choice is to use the subject locator; if that subject actually has one, that is, because it may be rather a document. In that case it is also quite likely that that subject locator is used in the queried map. Accordingly, the following
<http://www.johnlennon.com/> =
will uniquely pinpoint the topic for this web site.
For all other subject which 'only' may have subject indicators we can use something like
<http://en.wikipedia.org/wiki/John_Lennon> ~
The problem with that is that there may be many choices for appropriate subject identifiers. The likelyhood that the information in the queried map and that in the query itself will remain in sync may be low. This is the place where adhoc ontologies may come handy.
Another problem with using IRIs as subject identifiers is that they clutter query texts, making everything hard to read and write. One usual escape route is to use prefixes. They have to be defined in some environment map (more about that later) and can then be used throughout the query text to form QNames:
<wp:John_Lennon> ~
Hereby we assume that wp is associated with the Wikipedia vocabulary
(namespace).
It can be expected that subject identifiers will be the most predominant way of identifying topic. For this reason TMQL allows to write them shorter as item reference:
wp:John_Lennon
That looks almost like the above, but is actually quite different. Using a QName or URI
standalone directly identifies the topic. If there is none with that subject identifier,
then an error will be raised. Using the navigation <wp:John_Lennon>
~ first takes a URI literal and then tries to find all items which are using it as
subject identifier. That list may be empty and no error is generated.
One cheapskate way is to work with the topic item identifier (sometimes referred confusingly as 'source locator' in older literature). It is an identifier which the topic got assigned when it is created via an API or when it is loaded from an external resource (such as a CTM file). Should this identifier be known, then it can be used within a TMQL expression, such as in
john-lennon
These identifiers are conveniently short, but they suffer from the fact that some infrastructures have no reliable and robust way to keep them stable over a longer period in time.
A smaller problem with using topic item identifiers is that the can collide with TMQL keywords. While there are not many potential collision points, they exist such as in
for $i in $p >> characteristic where ....
Since a topic identifier can follow a characteristic axis, the
where keyword will mistakenly be interpreted as topic identifier. As the
following code, though, will not make any sense, this error can be detected at compilation
time.
In such situations one can always use * to allow TMQL to greedily consume
this as topic identifier.
Apart from this more or less strong form of identifying a subject, names usually also offer a rather good approach. As names are strings, there are two steps involved: first the string has to be converted into a topic name item, and then this name item has to be affiliated with topic in question. All this is achieved with
"John Lennon" \ name
Accordingly, any name can be used as well, just to improve our chances to find the appropriate topic:
"John Winston Lennon" \ fullname
All this works once fullname is a subtype of a name.
This weak form of identification not only works with names. It also works with any property put into occurrences:
1940-10-09 \ birthdate
Of course we have to expect also other people as results which happen to have the same birthdate as John.
Depending on your original data you may or may not have reliable identification data. As queries are usually kept separate from the instance data in the topic map it is advantageous to formulate queries as robust as possible.
If you had several subject identifiers, you can actually combine these to improve your chances that there is a match with the map you query (shortcut ORing):
<http://guess1.com/> ~ || <http://guess2.com/> ~ || <http://guess3.com/> ~
If there existed no topic with the first URI as subject indicator, the processor would attempt to resolve the second URI as subject indicator in the map, and so forth.
If, on the other hand, your identification characteristica are so weak that many topics in the map are like to match one of them, you can use the intersection:
1940-10-09 \ birthdate == "Yoko Ono" \ name <- woman -> man
The == operator in TMQL acts as intersection here, so even if there were
several people with the same birthdate and even if Yoko would have remarried, the overall
result would most likely contain only john-lennon.
There are a few predefined subjects TMQL itself assumes to exist. They should have been detailled in TMDM, but are now in the TMDM/TMRM mapping which provides the model semantics for TMDM.
One of the more prominent ones is tm:subject which stands for
everything which is in the map. It only includes the topic map items
for that matter, so not any literals. It can be abridged with *.
One primary purpose of every query language is to extract content from the backend store. That can be done in two different ways: Either by describing declaratively how the result should look like and then match this pattern against the instance data. The bindings generated in this match are then returned to the application. SQL and other pattern matching languages follow this paradigm.
Alternatively, things of interest in the instance data can also be found by choosing a known starting point and then navigating along also known axes to the information to be identified. One representative of this approach is XPath.
As the two approaches complement each other, TMQL adopts both, at least on the user level, where convenience is the primary directive. Both approaches can be used within one query expression. That allows the user to choose the most convenient and therefore the most maintainable way. Effectively both approaches are equivalent and indeed the formal TMQL semantics maps everything into path expressions.
Obviously the navigation axes provided by TMQL are those insinuated by the TMDM (Topic Maps Data Model). If you had a particular topic then you can follow one axis leading to all the topic's names. Another axis would lead to all its occurrences, another axis would deliver all topics which are the type of our original topic, yet another axis would connect to all instances. There are several of these axes, all listed in the specification (section 4.4). Some only are meant to be useful when starting with a topic item, others are meant for associations or occurrences, such as "finding the scope(s) of an association, or occurrence, or name", or "find all roles of a certain role type in an association", or "find all players of a certain role type in an association". Other axes are related to addressing (subject identification and locating), some with following reification, or with converting topic map items into values and back.
As example, let us consider the topic john-lennon. If we follow the name
axis from there
john-lennon >> characteristics name
then we can expect to get all name items of that topic. If there were a
biography occurrence, then
john-lennon >> characteristics biography
would render these occurrence items.
It is worth noting here, that TMQL does not make too much distinction between names and occurrences and subsumes them both under characteristics. Characteristics here do not include involvements in any associations (as was once historically the case).
To find the type of john-lennon we can follow the
types axis:
john-lennon >> types
and expect to see a list containing topic items, such as probably person.
When we had an association in our focus, then we can ask for its type, or it scope
assoc-2352633773 >> scope
although most of the time their item identifier is elusive.
For some axes it is important to specify a topic to control what is actually asked for, as in
the occurrence example above. For other axes the control topic does not matter, as for
finding all types of a topic. Such a control topic, if provided, is always interpreted by
honoring any subclass(es). If our john-lennon topic contained not only a
biography but also an early-years occurrence
john-lennon isa person biography : http://.... early-years: He was a child when he was young.
and that is known to be a subclass of biography
early-years iko biography
then the navigation step
john-lennon >> characteristics biography
will return both occurrences. That feature can be useful to find all occurrences:
john-lennon >> characteristics tm:occurrence
That works because tm:occurrence is predefined by TMQL as a concept from
TMDM and proper implementations have to know that all ocurrences in a map are an instance of
tm:occurrence.
In the same vein, we can ask for all names via
john-lennon >> characteristics tm:name
relying on the fact that tm:name is predefined as well.
All characteristics can also be retrieved:
john-lennon >> characteristics tm:subject
All axes can be travelled upon in two directions, a forward and a
backward direction; it is quite arbitrary, though, which is which. To
find all instances of a concept we navigate along the types axis, but
this time in reverse:
person << types
To find all subclasses of person we simply reverse the
supertypes axis:
person << supertypes
Reversing is also useful to find all involvements of a certain topic in associations. The expression
john-lennon << players victim
would find all associations where john-lennon plays the victim (or any
subclass thereof).
Different directions also make sense when navigating between topics on one side, and subject locators and subject identifiers on the other. If you had a topic in your hand, then in forward direction the locators axis would render all subject locators (or just the one). In backward direction you would start with a URL and would get all topics which use that URL as subject locator. Similar for the subject identifiers.
Many of these navigation movements have shortcuts, so for instance
john-lennon ^
is expanded to the canonical
john-lennon >> types
and therefore returns all types of the topic john-lennon. If we had to
extract the homepage occurrence from John, then the shorter
john-lennon / homepage
will expand into
john-lennon >> characteristics homepage >> atomify
To follow the player axis into associations and out of associations can be much more conveniently done with
john-lennon <- victim -> aggressor
than with
john-lennon << players victim >> players aggressor
While there are no shortcuts for extracting subject locators and subject identifiers from a topics, some exist for the other direction: To address a topic via one of its subject identifier a simple
<http://en.wikipedia.org/wiki/John_Lennon> ~
is sufficient; and if addressing has to happen via a subject locator, then
<http://www.johnlennon.com/> =
can be used to reach the topic which represents this website.
An atom in terms of TMQL is a data value with no further internal structure, at least as far as TMQL is concerned. Atoms are all integers, strings, dates, so any value for the predefined TMQL data types; or for any type which a particular TMQL installation additionally provides.
The problem with Topic Maps is that (a) according to the TMDM, values can only be string representations (loosely affiliated with a URI indicating the data type) and secondly (b), that values are always part of an occurrence item. That not only contains the value, but also the scope and occurrence type. In this sense, characteristics, so both names and occurrences are experienced in an ambivalent way: either as the complete item, or only as the value in it.
TMQL resolves this ambiguity first by making each of the involved steps explicit: first there is the extraction of the characteristics item from a topic item; then there is the conversion of this characteristics item into an atom. This last process is named atomification and is also managed by an axis:
john-lennon >> characteristics birthdate
>> atomifyThe obvious advantage of this explicit chain of navigation steps is that before the atomification process other intermediary steps can take place. One example would be filtering according to the scope
john-lennon >> characteristics birthdate
[ @ wikipedia ]
>> atomifyor
john-lennon >> characteristics birthdate
[ 0 ]
>> atomify
if we were only interested in a single birthdate characteristics.
Ultimately, it is the user who has to decide whether atomification has to be done, or not. Still, TMQL offers a shortcut for the ― much more frequent ― situation that atomification should be done automatically. More about that later.
The atomify axis can also be used in reverse. Accordingly, the situation is inverse in that we start off with a literal value and end up with all occurrences (or names) where this literal is used as value.
One example use is to find all characteristics where the string John
Lennon is used:
"John Lennon" << atomify
Now, that this not extremely helpful until we continue the chain and compute those topic items where those characteristics are attached to it with a certain type :
"John Lennon" << atomify << characteristics tm:name
Before we see how this can be abridged, let us use the same principle when using
birthday occurrences:
1940-10-09 << atomify << characteristics birthdate
This time we navigate to all people who are born on Oct 9, 1940; also for this a shorter version exists.
Atomification not only applies to characteristic items, but to other items as well. In terms of the TMQL specification the result of atomifying a particular topic is a 'null-operation', i.e. leaves the item untouched. Implementations are free to redefine this process and use their own atomification rules.
Consider the query expression
select $p where $p isa person
By default, the querying application would get topic items of type person
as-is, i.e. as items according to TMDM. If the query would be modified to
select $p >> atomify where $p isa person
then a TMQL processor will be asked to trigger atomification for all these
person topics before they are returned into the application. Still, by
default, this will be a null-operation, so nothing changed by adding the atomification step.
If our application, however, had defined various object classes, such as
PERSON and if it would have configured the TMQL processor to populate
object instances of the classes automatically, then it would get PERSON
objects without any further ado. Here, how this could look like in Perl:
package PERSON;
# here are the methods and constructor
1;
my $q = new TM::QL ('select $p ... ');
$q->register ( serializers => { 'person' => 'PERSON' });
my $results = $q->eval (...);
The first line would create a query object. Before that query is evaluated, the application
tells the processor that it wants to associate the class PERSON in the
Perl program with the type person in the map. Sometimes these objects are
referred to as business objects.
Of course, this is all outside the TMQL specification; also how de-atomification would work in this case.
While it may be intellectually satisfying to use axes as conceptual framework to navigate through a Topic Map instance, in everyday situations they are just too cumbersome to write. For this purpose TMQL introduces a number of shorthand notations to alleviate the pain.
To find all biographies of John Lennon it is actually sufficient to write
john-lennon / biography
because TMQL processors will expand this to the canonical form
john-lennon >> characteristics biography >> atomify
Similarily, the shorthand for getting names
john-lennon / name
can be used instead of
john-lennon >> characteristics tm:name >> atomify
There is, unfortunately, a small complication: if a TMQL processor would atomify immediately these characteristics then something like this
john-lennon / name [ @ family ]
will not work: The filter would only 'see' the literals; and these could not be filtered according to the scope, or anything else for that matter.
The regulation is that an 'atomify' navigation movement does not immediately create literal values. It only schedules the literal for atomification, so postpones the process until actually the literal is needed.
That is the case in only well-defined situations: Either the characteristic is passed in as parameter into some other function, be it for arithmetic operations or for comparison, the characteristic is passed back into the application, or is a value is needed because we want to de-atomify it. Here is a slightly artificial example:
john-lennon / birthdate << atomify
First we figure out when John Lennon is born, and then we use the date literal to get all characteristics where it is used as value. There is also the anti-symmetric shortcut to do de-atomification and to navigate to the involved topics. The expression
john-lennon / birthdate \ birthdate
will return all topics (most likely people) who are born at the same day as John. The
\ birthdate expands to << atomify << characteristics
birthdate.
The shorthand for the reverse atomification also works for names. This can come quite handy if one happens to know a name of a certain type and has reasons to believe that it is sufficiently unique within the map to identify the topic in question:
"Ringo" \ nickname
According to TMDM you can reify certain things in a map, specifically characteristics, associations and complete maps. In a sense, moving from a topic to the thing it reifies is a bit like zooming.
Let us assume that the queried map contains an association
marriage (man: john-lennon, woman: yoko-ono)
and that this association is reified by a topic two-in-bed which contains
more details about this relationship.
If this topic is used as starting point, then the underlying association can be easily found:
two-in-bed ~~> -> man
~~> is the shortcut for following the reifier axis forward.
If characteristics, so names or occurrences were reified, this would work exactly the same way. Also maps can be reified, something we will have to revisit later. But as soon as you have a topic which reifies a map, you can zoom into this map and get all its items. The following gets all names of operas in the opera maps:
opera-map ~~> // opera / name
Variables in TMQL are unlike variables in a procedural, state-oriented programming languages, in that they are storage for values and values are assigned and reassigned during the course of the execution. Instead, TMQL follows the tradition of functional programming languages (yes, there is a deeper reason behind this): Variables there are bound to values at some point and then are married until death tears them apart; death here meaning the end of the variable's scope.
The scope of a variable is a lexical range within a query expression. It is normally quite obvious, especially for the FLWR style where variables are explicitly declared with a FOR clause. In that case their scope extends until the end of the enclosing FLWR expression. Similarily, for variables declared with SOME or EVERY clauses: also here their scope extends to the end of that clause.
For path expressions the situation is quite simple: you cannot declare any variables, so
scope is not an issue. This is, of course, a double-edged sword. It makes simple queries
very simple, but does not scale well with complexity. The only variables of use in path
expressions are %_ for the whole map, i.e. all items in it, the
@_ for the whole incoming tuple and $0,
$1, etc. for the individual components of the incoming tuple. All other
variables are treated effectively as constants.
Using the SELECT style, variables are not declared explicitly, but are handled implicitely depending on how a query expression is used; more about that below.
One way variables get their values is that they are specified to range over a sequence of (computed or constant) values. This is the way FOR, SOME and EVERY clauses work:
for $p in // person ...
In this case the variable is explicitly quantified.
Alternatively, a query author can keep the variable unquantified, leaving it to the processor to implicitly let the variable range over all possible values:
select $p where $p isa person
Here $p iterates over all items in the currently queried map and
effectively each value is then tested in the WHERE clause against the condition therein.
For practical reasons this set of all possible values must be finite; TMQL defines this as
the 'map', i.e. all topic, name, occurrence and association items.
In this sense, TMQL operates on an closed world assumption, i.e. does not regard unknown information to be available outside its universe.
When using the SELECT flavour in TMQL then some care has to be taken when variables are involved. Consider for instance
select $p
where
$p isa person
Here it is obvious that the variable $p should range over all instances
of person and that each of these topics should be returned. More
precisely, $p is constrained by the condition in
the WHERE clause. If we would add a new variable $x in the SELECT
clause
select $p, $x where $p isa person
then $x is NOT constrained.
While one could take an orthodox position and interpret this as saying "find all instances
of person and pair each of these with ANYTHING within the queried map",
this is most likely not what the author had intended. More likely than not, queries such
as these are more the result of a typo, an error, a misunderstanding, or a combination of
all of the above. Allowing them to be valid may result in very expensive queries.
To avoid such unintentional dramatic consequences, the above expression is ruled to be invalid: a TMQL processor expects all variables appearing in a SELECT clause to be (a) either bound to a value as provided by the context, or (b) to be mentioned explicitly within the WHERE clause.
Should the user really want all things, then he has to say so:
select $thing where $thing isa tm:subject
TMQL knows about a few special variables, although most of the time you might not be interested in them. What is peculiar about them is that some are read-only, so that cannot be redefined with new values, and some are write-only so that when values are assigned to them, they cannot be retrieved.
One such special variable is $_. It can be used inside a WHERE clause
when a variable is necessary for a successful match, but where one is not interested
in its actual value (read: "do not care"):
select $p / name
where
is-leader-of (organisation: $_, leader: $p, ...)
& $p isa person
Here we try to identify all leaders, i.e. persons who have been
leading an organisation at some point. We do not overly care here which organisation it
is, so we use $_ to signal this to the TMQL processor, and equally
important, to the human reader.
Several instances of $_ within the same scope are regarded to be
independent from each other. So in the query
select $p / name where is-leader-of (organisation: $_, leader: $p, ...) & $p isa person & is-part-of (whole: mafia, part: $_)
the use of $_ will not have the intended effect, namely that any
matched organisation will also be part of the mafia. This also implies that the variable
$_ can never be 'read', i.e. used in a SELECT or RETURN clause. The
following is invalid
select $_ / name where $_ isa person
Another special variable is @_. It refers to 'the current tuple' as a whole,
so the (@_) projection in
john-lennon ( . / name, . / birthdate ) (@_)
is completely redundant.
More interesting are the individual tuple components, which are ordered from left to
right to be bound to $0, $1,
$2, etc. To flip two columns, you could use
john-lennon ( . / name, . / birthdate ) ($1, $0)
but the variables can certainly be used as starting points for any navigation or computation:
john-lennon / shoesize ( $0 + 10, $0 \ shoesize )
That path expression would first compute John Lennons shoesize and would then create a pair where the first value is the shoesize increased by 10 and the second component being all topics which have John Lennon's shoesize.
Tuple variables which are NOT bound to values lead to an error, so
(1, 2) ( $2 )
will make the processor terminate.
Like any other query language, TMQL has to deal with two phases: one in which values in the queried map is identified and the second where that content is used to produce new content.
To convey these values from the incoming to the outgoing phase values are bound to variables. At one particular point in time a query processor will always look at a variable binding set, i.e. a set of variables and the values bound to them.
Since variable are scoped, i.e. they are only visible in well-defined parts of the query expression, some variable will get their values from outer expressions and variables defined only in a nested scope will get values there. Effectively, a processor maintains a stack of such variable bindings. Whenever a nested query expression introduces a new variable (implicitly or explicitly) such binding will be put onto that stack. Once the nested subexpression is completely evaluated, that binding will disappear from the stack.
One other variable is %_, it stands for the currently queried map, or
to be more precise, for all the items in the context map. When an
application invokes a TMQL processor it may pass the map to be queried into the query
process simply by associating it to %_ in the first place. Then it is
not necessary to mention the map inside the query expression any further as in
select $p / name
where
....
because the default (also in path expressions and using FLWR) is that
%_ is used anyway:
select $p / name
from %_
where
....
You can also explicitly name the map which should be queried, say, one which has been
affiliated with the variable %mymap at some point:
select $p / name
from %mymap
where
....
Doing that will implicitly create a new variable %_ which binds the
contents of %mymap, so that throughout the rest of the query that is
used as context map.
Another way to change the map to be queried is to follow the reification axis,
i.e. zoom into the map. For this let us assume we had a topic, say,
opera which is defined within the map (or in the environment map).
If that topic reifies a map, then
select $p / name
from opera ~~>
where
....How the TMQL processor and the underlying Topic Map infrastructure handles map reification is outside the scope of TMQL.
It can even go that far, that we do not need explicitly a topic which stands for a complete map. Also a URL, interpreted as subject identifier, can serve for this purpose:
select $p / name
from <file:/home/user/mymap.xtm> ~ ~~>
where
....
When the TMQL evaluates the FROM clause, it does the usual thing: It first finds a URL
literal, followed by ~. This should indicate a topic with that URL as
subject identifier. TMQL processors can temporarily assume such topic in the environment
map. More importantly, the reification step will lead from the topic to the map, so will
ask the processor to consume the map and make it the current context map
%_.
NOTE: This is an experimental feature.
The map to be queried is not necessarily one which is statically stored in a file or in a database, although that might be the most common case. It is also possible to compute a map before it is queried; a map is just a set of items anyway.
As one example we again query the map bound to the variable %mymap, but
this time we add to it an opera ontology which happens to be stored
on a remote web server:
select $opera / name
from %mymap ++ <http://far.aw.ay/opera.ctm> ~ ~~>
where
....'Adding' here means of course merging. Whether a TMQL processor will try to download this ontology over and over again, or will cache it locally, we do not care; these are all operational details.
Whenever a query is evaluated, it is done in the context of a map. In many cases such a
map will be passed in as parameter from the application, but it is also possible to pass
in several maps and ― for a particular subexpression ― select one of them to
make that the context map. If it becomes necessary, one can access this context map via
a special variable, %_; but all operations by default are referring
to it.
In any case there is also another map which is always present: the environment map. It contains everything the TMQL processor has to know, starting from the concept of a data type and that of a function (or predicate) up to all data types themselves the implementation offers, together with the operator and functions.
The environment is not necessarily fixed. For every (sub)expression new environmental information can be added. In the most primitive case such local environments will define prefixes together with a namespace URI, so that in that subexpression these prefixes can be used to form QNames that namespace. Semantically speaking, though, such prefixes are nothing else than a shorthand to address an ontology. That, as for PSI sets, will simply contain a list of subject definitions. Or, more generally, such a ontology might include a whole taxonomy, so not only the subject definitions, but also a type system for them. Or, even more generally, the ontology might contain predicates, constraints or functions to describe to structure of the domain in question.
TMQL does neither forbid or mandate all of this. Its only expectation is that all the ontological information comes in form of a topic map: prefixes for namespaces (vocabularies) are simply topics representing the whole ontology with the namespace URI as subject indicator. Functions are topics of a certain type carrying the function body with them, as so are also predicates (constraints).
Once the environment map is known for a particular subexpression, it will be 'seen
together' with the current context map, or ― in Topic Maps speak ― both will
be merged for the duration of that expression. If you ever wanted to access it, you can
use the variable %% is bound to it.
As we have seen above, variables can be bound to values. Used naively, this can lead to incorrect queries. As an example let us find all two albums which share the same producer. In a first attempt we write:
select $album1, $album2 where is-produced-by ($album1: production, $producer: producer) & is-produced-by ($album2: production, $producer: producer)
If you have worked with declarative languages before, you may immediately spot the
problem: For the TMQL processor $album1 and $album2
are completely different variables; the variables might be bound to the same or to
different value for the same producer, the processor does not care.
This does not work for us if we want different albums. The usual escape hatch is to have something like this:
select $album1, $album2 where is-produced-by ($album1: production, $producer: producer) & is-produced-by ($album2: production, $producer: producer) & not $album1 == $album2
Not only is this ugly as hell, in 100% - ε of all cases developers will forget to add this (I know I will). And it does not actually reflect the developer's intention; and it does not look too elegant if you have to compare three or more such variables.
TMQL has a rather eccentric way to fine-control when variables are allowed to match anything or when they must be bound to something different. It is using primes after the variable names:
select $album, $album' where is-produced-by ($album : production, $producer: producer) & is-produced-by ($album': production, $producer: producer)
Now we have used two variables which only differ by the number of primes
(') appended. TMQL treats them as two distinct variables, but with the
additional semantics that ― within one and the same binding ― they cannot be
bound to the same value.
There is no limit to the number of primes, so should we ― by a bizarre twist of fate or customer requirements (whatever comes first) ― need three different albums, this can be achieved elegantly:
select $album, $album', $album'' where is-produced-by ($album : production, $producer: producer) & is-produced-by ($album' : production, $producer: producer) & is-produced-by ($album'': production, $producer: producer)
While a query is executed, the TMQL processor will keep book on which variables are bound to which values. This data structure is organized as stack: Whenever a new variable binding, or more generally, a set of such bindings, has been created, the whole set will be pushed onto this stack. With that stack the nested inner part of a query expression is evaluated and the results are collected. Once this is done, the last binding set is removed (popped) from the stack and possibly a new binding set will take its place to repeat the evaluation of the nested query expression.
That way ― while binding sets vary ― new results will be assembled into a larger result. Once the query is complete, again the last binding set will be removed.
To kick off the whole process, there must be an initial binding set which the calling
application (or the TMQL infrastructure) provides. While the application is free to pass
in any number of variables together with values, the only thing the TMQL really needs is a
binding for %% to some map. That map will be interpreted as the
environment map. From then on the TMQL processor will create new binding sets on its own.
@@@ TO BE MERGED @@@
One special variable is %%, the current
environment. It contains everything the processing infrastructure of a TMQL has
to offer: predefined data types, functions, predicates and also other ontological
information. And since we are in Topic Map-land, that whole environment is modelled as
topic map. Data types are topics, so are functions, predicates and so are external
vocabulary (namespaces).
Since it is a map, we can access the information by querying %%. The
following lists all functions (predefined or otherwise), i.e. their names together with
their description:
select $f ( . / name, . / description ) from %% where $f isa tmql:function
The only thing to be done is to switch the context map onto the environment map and to
query for the function topics. All these must be an instance of
tmql:function, one of the types TMQL defines in its own ontology.
To find all namespace URIs we have to find first all topics in the environment map which are ontologies, i.e. something which represents a whole namespace. Once we have such a topic, we extract the subject indicator(s) for it.
select $o, $o >> indicators from %% where $o isa tmql:ontology
TMQL processors will all have to provide the ur-environment, so that is used by default. But it may certainly possible to allow applications to change it:
my $query = new TM::QL ('select $p from ... ');
my $map = new TM ('file:here.ctm');
my $results = $query->eval ('%_' => $map,
'$shoesize' => 42,
'%%' => $my_env);
WHERE clauses and filters of path expressions always contain boolean expressions. Their main purpose is to describe declaratively a particular pattern within the queried map. The processor's task is to find all combinations of variable bindings which make the boolean expression true.
Boolean expressions can be combined with boolean operators. These are all non-short-circuit, i.e. the order of the individual expressions in an AND or an OR does not matter.
In the query
select $p / name where $p isa person & plays-instrument (player: $p, instrument: drums)
the free (and only) variable is $p. A TMQL processor will therefore
existentially quantify this variable, i.e. let it range implicitly
over all items of the map. Those constellations of values which make the boolean
expression in the WHERE clause true, will be passed on to later processing stages in form
of a variable binding set. All other bindings will be discarded.
Of course, the let a variable range over all items in the map is just the specification's formal way to say we do not care how an implementation does it, so that implementations can develop clever and fast mechanisms such as indices to find fast what is actually needed.
In the above case, an implementation may recognize the $p isa person
part and will ― instead of actually looping over all topics and associations in the
map ― use an index delivering person topics fast. Maybe it also
maintains an index over plays-instrument associations and will simply
compute an intersection of the two indices.
What is also worth noting, is that free variables will therefore only take values from the map, but not values from the primitive data types, such as integer. As a consequence the condition below will never be satisfied:
...
where
$p isa person
& $p / birthdate > $d
While $p is ranging happily over all person topics,
there will be no map item for $d which will make the second condition
true. This should be seen as another safety feature.
Only that a variable appears in a WHERE clause does not mean it is free. In the FLOWR query style all variables are declared together with the range they should iterate over:
for $p in // biological-unit where $p isa person & plays-instrument (player: $p, instrument: drums)
Now $p is non-free and the processor will not automatically let it
range over all possible values.
Once a tuple sequence has been produced, a filter can be applied to select only those tuples which satisfy a certain criterion. Accordingly, filters are postfixed to the expression producing the tuple sequence. To find all instruments Paul McCartney plays which are heavier than 50 kg, one might write:
paul-mccartney <- artist -> instrument [ . / weight > 50 ]
While filters can contain any (primitive) boolean condition, they are usually quite
short. To allow for even more conciseness a number of shortcuts have been introduced. To
filter for scope a simple [ @ my-scope ] will do. To filter for certain
types [ ^ my-type ] is enough.
There are also shortcuts when it comes to filtering along the position in the incoming
sequence. To get the first tuple from a list it is sufficient to write
[0]. Also slices are covered, although with a smaller gotcha for Perl
and Java developers and only the lower bound is inclusive. The filter [3
.. 5] will select those with the indices 3 and
4 but not that with 5.
Primitive boolean expressions might be values, so also (among others) atoms such as
undef, or the boolean values true or
false. In the ― rather synthetic example
select 1 where undef
This will be expanded to
select 1 where exists undef
and that further to
select 1 where some $_ in undef satisfies not null
Now that undef is definitely a value ― albeit an undefined one
― there exists a binding for $_. And since the not
null is always TRUE, the whole clause will evaluated to TRUE.
What applies to undef also applies to all constant values, so as a
consequence it also applies to the boolean atoms true and
false:
select 1 where false
That maybe somewhat non-intuitive for programmers which would expect to not return anything.
It is the constant null which actually fills the role to indicate that
falsehood. It expands to the empty sequence ().
To express a condition which is only supposed to hold for a defined set of variable
bindings, TMQL offers the SOME clause. If our universe of discourse would contain several
people, but only one single musician, then
some $p in // person satisfies $p isa musician
would be true.
Interesting is the corner-case where there are no persons in the first
place; here the semantics rules that the whole condition is false.
The SOME clause is actually only syntactic sugar and can be always rewritten as FLWR expression. In our case this would be
for $p in // person where $p isa musician return $p
That expression returns a non-empty tuple sequence exactly when there is at least one person who is a musician.
The SOME clause is effectively generalizing the exists semantics of value comparison. To make this obvious we observe that the condition
where $p <- artist -> instrument / name == 'Piano'
is equivalent with the more longwinded
where
some $n in $p <- artist -> instrument / name
satisfy $n == 'Piano'
The EVERY clause is syntactic sugar on top of the SOME clause and should only avoid that query authors will have to twist their brains. If someone watches too much Australian Idol TV shows, then he might write:
every $p in // person satisfies $p isa musician
Since TMQL assumes a closed world, this can be equivalently transcribed into
not (some $p in // person
satisfies not ($p isa musician) )which is exactly what the formal semantics does.
One of the consequences is, though, that if there is not a single person in the map, then the whole condition is true.
There are also use cases where it is necessary to give upper and lower limits on how many matched patterns there exist. If we had to look for girlie power bands, i.e. those where there are at least 5 members, then a query
where
group isa $group
& at least 5 $g in $group -> member
satisfies $g isa female
can achieve this. Of course, there is also a way to constrain the upper bound and that is
simply done by using at most N. As with the lower
bound N must be a positive integer greater than 0.
These clauses seem to be superfluous as one might achieve the constraint also by counting the numbers
fn:count ( $group -> member [ . isa female ] ) >= 5
but there is a subtle, but important difference: With count one is effectively
asking to compute all values and then count these to
compare them with the lower or upper limit.
Conditions do not allow optional matching, i.e. to set up sub
conditions which may, or may not be true. SPARQL, the query language for RDF-based data
has to have this as it is only pattern oriented. In TMQL path expressions provide a way to
deal with optional information: one just tries to navigate to the relevant parts and when
the result happens, one can choose whether a default value such as
undef should be used:
select $p / name, $p <- member -> group / name || undef where $p isa person
In the example above we list all persons in our map and additionally check whether they
are part of a (music) group. If they happen to be no group member, then
undef will take up one value there. That avoids that the person is
completely discared if it had no membership.
Every so often one needs to constrain topics by their involvement in associations. If we needed to find all musicians who play drums then the following may deliver this:
select $p / name where $p isa person & plays-instrument (artist: $p, instrument: drums)
What we are actually looking for are all associations of type
plays-instrument, where there is one player $p who
plays the role artist and another player drums for
the role instrument.
A TMQL processor, however, will interpret this a bit more abstract, in that it also allows
matching associations be of any subtype of
plays-instrument; and also the role types in matching associations may
be subtypes of those specified (artist and
instrument); which implies that an association in a map
conducts (conductor: karajan, orchestra: berlin-philharmonic)
will be successfully matched by the predicate as long as conducts is a
subclass of plays-instrument, conductor subclasses
artist and orchestra subclasses
instrument.
What we are also implicitly saying with the above predicate, is that there must not be any other role in such matching associations. An association with other roles, such as
conducts (conductor: karajan,
orchestra: berlin-philharmonic,
concert : beethovens-ninth)
should be dismissed. Should this not happen, then this has to be signalled to the TMQL
processor in that other roles may well exist when matching with the predicate. This is
achieved with adding the ellipsis ... as last player:
select $p / name where $p isa person & plays-instrument (artist: $p, instrument: drums, ...)
While association predicates look like a special language feature, they are actually path expressions in disguise. The predicate invocation
plays-instrument (artist: $p, instrument: drums, ...)
is interpreted as look for all associations of type
plays-instrument (and its subtypes), which have one role
artist (or any of its subtypes) with the value which coincides with
that of $p; and another role instrument with a
player drums. This can be formulated as path expression
quite easily:
// plays-instrument [ . -> artist == $p ] [ . -> instrument == drums ]
Since the ellipsis at the end indicates that we do not care about other roles, both forms are equivalent.
Things get slightly more complicated if the ellipsis is missing, i.e. only the two roles and no further ones are allowed. Also this can be translated into a path expression, albeit a less obvious one:
// plays-instrument [ . -> artist == $p ]
[ . -> instrument == drums ]
[ not (. >> roles -- artist << superclasses
-- instrument << superclasses ) ]
The only thing which has to be changed is the last filter. It first extracts all roles
from the association under consideration. From this list it deducts then first the
artist (and all its subclasses) and then the
instrument (and its subclasses). If there is any other role left, then
the association does not pass the test.
A TMQL processor cannot operate in a vacuum. Inside a query expression it has to be possible to use outside information, such as data types, predefined or not, and their related functions; also ontological information about the map being queried must become available, be it just the taxonomy or be it something which also includes rules and constraints in some sort of rule language, or even logic. What it also involves is that it must be possible for a developer to add his own functions, predicates and concepts to a TMQL expression as needed. This is all collected in the processing environment.
Conceptually speaking, everything of the above can be regarded to be ontological information, and ― since we are still in TM-land ― as topic map. To achieve this viewpoint, all predefined concepts have to be thought as topics, possibly organized into classes and possibly connected via associations.
Using whole URIs to identify topics (actually the subjects they stand for) certainly clutters queries. Using prefixes for namespaces is a comfortable way to shorten queries considerably, so TMQL also uses this mechanism; albeit, not in a syntactic way, like in XML, or RDF, but more appropriately here, in a semantic sense.
It is straightforward to view namespaces as (external) vocabularies. One example of this would be Wikipedia which itself is organized in a topic-oriented way, giving many concepts a distinct URL.
If we plan to use many references to topics done with subject identifiers from the
Wikipedia URL space, then using a prefix for
http://en.wikipedia.org/wiki/ can help to keep queries readable. Such a
prefix can be easily declared in the environment map, just before the query expression it
is meant to visible:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ CTM? wp isa tmql:ontology ~ http://en.wikipedia.org/wiki/ """ select $p / name where $p / birthdate <= wp:John_Lennon / birthdate
The environment map here contains only a single topic, one with item identifier
wp and subject indicator
http://en.wikipedia.org/wiki/. The fact that this topic is an instance
of a tmql:ontology (more about that later) makes the TMQL processor
do only one thing: it will register that wp can be used as prefix in
QNames in followup query expressions. The namespace bound to this prefix is that Wiki URL.
If now inside a query expression an item reference is a QName with such a prefix, that
will expand to the subject identifier provided for it, in our case
http://en.wikipedia.org/wiki/John_Lennon. If we had a topic in our
queried map which is using this as subject identifier, then we have successfully addressed
our topic.
These prefixes also work inside IRI literals, where QNames are used as shorthand, such as
in "wp:John_Lennon" = to address the Wikipedia page about John Lennon
itself.
We can drive the above prefix mechanism further by interpreting the namespace indicated by
the URI as a namespace known to the TMQL processor. Obvious examples
here are all the namespaces for those things the TMQL processor has to know about in the
first place, such as the namespace for TMDM concepts (topic, association, ...), one for
all the data types it offers (integers, etc.), including all the functions and operators
in it; and, of course, one namespace for TMQL itself, where it defines things like
tmql:ontology we have already used above.
How a processor knows certain namespaces ― apart from the predefined ones ― and how it can learn about others, is outside TMQL itself. But it should be clear that such a mechanism is an excellent way to extend existing processing environments.
As one example consider the following Python code
import math
proc = TMQL.processor()
proc.registerNS ('http://maths.is.great/', math)
proc.eval ('select ....')In this hypothetical TMQL implementation we would first create a processor object. Before it is used to evaluate a TMQL query, the application registers a library under a particular URI. In the query itself we declare a topic which uses this URI as subject identifier:
""" math isa tmql:ontology ~ http://maths.is.great/ """
The processor will register that math is a new prefix. It also will
look at the connected namespace URI and will then realize that this very URI has been used
before by the application to register a whole Python mathematics library. If now in the
followup query expression a function math:sqrt is referenced, then the
processor will simply invoke this function from that library.
In an ideal world, the queried map will contain all information necessary to produce meaningful results. But for practical reasons this will not be always the case. Sometimes bits and pieces of the information have to be added explicitly, sometimes a whole set of vocabulary or even a type system have to be added to the queried map.
So, before a query can be executed this missing knowledge has to be added, i.e. merged into the currently queried map. Such additional domain knowledge can be simply more topics or more associations. It can also include additional functions and also predicates.
In any case, there are two minor options here: Either that knowledge is stored in an external resource, a local file or on a remote server; or that knowledge is directly embedded into the query. Which one to choose, is simply a matter of convenience.
Let us first assume that we only need little additional information which we would like to hard-code into the query, namely, that Tom Waits and Kathleen Brennan have composed an opera, a fact missing from the queried map. Accordingly, the environment map contains a topic which itself contains a topic map in one of its occurrences
tomwaits-info isa tmql:ontology
! name: Tom Waits Info
description: just a few bits an pieces we need for the query
tmql:body: """
composed-by
composer: tom-waits kathleen-brennan
opus : opera-alice
opera-alice isa opera
! name: Alice
# subject indicators for all topics go here
"""
Now that tomwaits-info is known to stand for a topic map we can use the
zoom operation to reach that map and add it to the context map %_:
select $opera from %_ ++ tomwaits-info ~~> where $opera isa opera & composed-by (composer: tom-waits, opus: $opera)
If we were diligent enough to take care that all appropriate topics of the original map are merged with the little information we had in our inlined map, then the opera Alice would be part for the result.
Alternatively, we could also have stored additional information into a separate resource,
say, into a file tomwaits-info.ctm. In that case the URL
file:/where/ever/tomwaits-info.ctm serves as perfect subject
identifier for the map therein. The environment will contain now:
""" tomwaits-info isa tmql:ontology ~ file:/where/ever/tomwaits-info.ctm ""
Whenever we ask the TMQL processor to zoom into this map with
tomwaits-info ~~>, then it will realize that there is no inline
tmql:body, but instead a subject identifier. It will resolve this
resource and will consume it as CTM stream (which might be the default format). Once this
has been successfully loaded, it can merged with the queried map.
As a minor variation of the theme we can also consider other formats than CTM, say LTM.
"""
ltm isa tmql:ontology ~ http://www.ontopia.net/download/ltm.html
tomwaits-info isa tmql:ontology
isa ltm:instance ~ file:/where/ever/tomwaits-info.ltm
""
What has changed to before is that the Tom Waits information is now additional an
LTM instance, in other words a stream according to LTM. That it is
LTM and not something else we tell the TMQL processor by declaring the
ltm namespace on the first line. It is the provided URI which will help
the TMQL processor to recognize the LTM deserializer software. That can be hard-coded into
a platform which is using LTM frequently. But like the the example with the
math library before, the application could be made in charge of this as
well. In that case it might be responsible for binding a deserializer object to the
namespace:
import Ontopia.TopicMap.LTM
proc = TMQL.processor()
proc.registerNS ('http://www.ontopia.net/download/ltm.html',
LTM.Deserializer())
proc.eval ('select ....')
What if an ontology exists, but is not in Topic Map format but, say, in OWL? TMQL processors are free to convert from other notations and paradigms into the Topic Map universe:
"""
owl isa tmql:ontology ~ http://www.w3.org/2002/07/owl#
opera isa tmql:ontology
isa owl:Ontology ~ http://www.ope.r.us/opera.owl
"""In a sense the topic map representing the opera ontology is virtual.
To bootstrap all knowledge a TMQL processor has about the world, we have to define first TMQL's concepts, such as 'function', 'predicate' and 'ontology'. This is done by the standard: @@@@@@@@@@@@@@@@ CTM
tmql isa: tmql:ontology ~ http://www.isotopicmaps.org/tmql/1.0
to affiliate the TMQL vocabulary with that subject identifier.
At evaluation time the usual happens: the TMQL processor will analyze the environment
map, will find a topic (tmql) with a certain subject identifier, will
recognize that as a predefined one and will provision the TMQL concepts for further use.
Apart from some minimal TMQL concepts, a TMQL processor has to understand types,
specifically the adopted primitive types, such as integer or
decimal. Also this information is organized in the initial
environment map, including facet information about these types, such as their
serialization syntax (how is a value written as text, what is the canonical
representation). And types certainly are organized into a taxonomy (type system), so that,
say, integer is modelled as subclass of decimal.
Functions in TMQL work quite conventionally at first sight: If a function is invoked, then first the parameters are evaluated, then passed into the function. The body of the function then somehow produces a result whereby TMQL functions are pure as they cannot modify anything on the outside, at least as the TMQL processor is concerned. It is still possible to create side-effects by calling functions from external libraries, and these are free to do whatever they are programmed to do.
To allow a function to be invoked, it has to be declared first. For the set of predefined functions this is the task of the processing environment. @@@@@@@@@@ NO OWN FUNCTIONS...@@@@
Semantically speaking, functions define a relationship between values. With that, functions describe properties and constraints of the application domain. In this sense they are ontologic knowledge.
Especially with functions having few parameters, it can be cumbersome to always explicitly name the formal parameters (named parameter affiliation). TMQL functions can also get their parameters using positional parameter affiliation.
To demonstrate this, we rewrite the function nr-albums, replacing all
variables which are meant to be parameters with $0,
$1, etc:
""" nr-albums isa tmql:function description: computes number of albums for a certain person tmql:return: fn:count (// albums [ . <- opus -> author == $0 ]) """
The function can then be invoked with nr-albums ($p) now.
Functions are not restricted to produce simple values. As the body of the function can be anything which returns content, a function can return also whole tuple sequences. They can also return TM or XML content, as in the following example:
get-xml-albums is-a tmql:function
description: produce an XML fragment with all accounts
tmql:return: "
<albums>{
for $a in // album return
<album>{ $a / name }</album>
}</albums>
"Whenever the function is invoked, it will create an XML fragment carrying an <albums> node; embedded in it are <album> nodes with the album title.
As TMQL offers predefined data types (integer, decimal, etc.) and also defines its own (tuple sequences), it will provide also functions and operators for values of these types. The current thinking is that all relevant functions and operators from XQuery 1.0 and XPath 2.0 are included. Needless to say that this would be cool but it also puts quite some burden on implementors as this list is massive.
In any case, both, the functions and the operators, will be preloaded for an TMQL processor:
@@@@@@@@@@@@@@ NEW SYNTAX @@@@@@@@@@@@ """ fn isa tmql:ontology ~ http://www.w3.org/TR/xpath-functions# op isa tmql:ontology ~ http://www.w3.org/TR/xpath-functions# """
just to make sure that the prefixes fn and op can be
used inside a query:
select op:numeric-add ($p / age, 20) where ....
(Of course, that can be written shorter as $p / age + 20, as the infix
operator is mapped onto op:numeric-add. Which operators can be used as
prefixes, infixes or postfixes will be defined in the TMQL standard.
TMQL also includes a rather abstract extension mechanism to allow TMQL infrastructures to add function libraries. For TMQL a suite of functions (and the related types and objects) is 'just an ontology'.
So if we had a Python library for HTTP support, then a TMQL processor might pick up on the following subject identifier
""" http isa tmql:ontology ~ http://my.name.space/http/ """
because the application linked the library with it at some earlier point. Here an example how this might pan out in Python:
import httplib
proc = TMQL.processor()
proc.registerNS ('http://my.name.space/http/', httplib)
When the TMQL processor encounters a function invocation such as http:get
('http://www.google.com') it will first associate the prefix
http with 'some ontology' represented by a topic. It will then find
that topic in the environment and will check ― well, at least once ― its
subject identifier. Since that matches that URI of one of registered libraries, it will
try to call get there. The remaining details are all local matter for
the TMQL infrastructure.
As small variation of the scheme is to have the functions not in a 3rd party library, such as in Java, Python or better Perl; instead the external function library uses TMQL itself. In that case all functions must be defined as topics within a map, say, using CTM, LTM or AsTMa: @@@@@@@@@@@@@@@@
...
group-size isa tmql:function
desc: returns size of a Pop group
tmql:return : { fn:count ($0 <- group -> member) }
nr-girlie-groups isa tmql:function
desc: computes nr of girlie groups
tmql:return : .....
...
Having this stored in file:groupies.atm, that URL can also serve as
subject indicator for the group of functions which can be found there. In the TMQL query
expression we simply refer to it:
""" grp isa tmql:ontology ~ file:/usr/local/tmql/groupies.atm """
and then use the functions as in, say, grp:group-size (wp:U2).
The procedure for the TMQL processor is also similar to above. This time, though, it will
have to follow the subject identifier file:/usr/local/tmql/groupies.atm
to hunt down the necessary information. We leave it here to the processor to figure out
itself the format the map is stored in. In any case, we expect the processor to parse the
document, and realize the functions in there.
The fact that functions are topics, allows a last variation. A processor may allow a developer to use a language other than TMQL, say, Python:
"""
ctime isa tmql:function
return @ python: """
from datetime import ctime
return ctime()
"""
"""All we needed to do was to set the scope accordingly. Needless to say, that this is another excellent extension mechanism, but only if the function itself is short enough to be directly included. Which probably rules out COBOL, FORTRAN and Java (grin).
When using the SELECT or the path expression flavour of TMQL, then you will always get a sequence of tuples of 'simple things', i.e. atomic values. In the expression
select $p / name, $p / birthdate where $p isa person
the overall result is a table, the first column filled with people's names and the second column with their birthdate(s).
But only if our queried map is formed in such a way that every person has exactly one name and exactly one birthdate, then there will be a one-to-one relation with the result tuple sequence. If one person had two names, then each name would appear with one and the same birthdate; if one person had no birthdate information, a tuple for the person might never be generated.
In most cases this is the intention, in others you may want to use a default value to enforce a value there:
select $p / name, $p / shoesize || undef where $p isa person
and in other cases, you maybe want to be certain to get exactly one name (regardless of its type):
select $p / name [0], $p / shoesize || undef where $p isa person
This is the place where the developer is exposed to the flexibility of the TM data model. Ultimately, it is him (or her) to decide what should go into the final result.
TMQL does not need a dedicated syntax for grouping, because it is implicit in the way tuple sequences are constructed. According to the TMQL semantics first binding sets are tested, and only if the tests succeeded with these bindings sets new tuple sequences will generated.
To illustrate this, let us consider the following:
select $p / name, $p / shoesize where $p isa person
Here first all binding sets will be generated where $p is bound to one
person item. Only then for each such binding set the tuple expression
in the SELECT clause is evaluated. Each such individual binding set creates a tuple
sequence, so that the grouping is done along the binding sets. The overall result is then
the concatenation of all these partial tuple sequences. In the above case the processor
will do this concatenation, but as there is no further requirement to keep the partial
tuple sequences together, processors may deliver the whole sequence in any order.
One way of grouping the partial sequences is using the ORDER BY clause:
select $composer / name, $composer / birthdate where composed (... $composer ...) order by $composer / name
For each binding set the processor will evaluate the value expression in the ORDER BY clause and will expect to see exactly ONE value there (it can be empty, though, and if there are more, just one is picked). Then the respective binding sets are sorted according to these values. In this order then the binding sets are used to evaluate the tuple expression inside the SELECT clause.
One consequence of all this is, that a processor will now have the blocks of partial tuple sequences ordered according to the composer's name. Inside one such block (again, one partial tuple sequence may be arbitrary long) there is no ordering at all.
One small variation demonstrates that the ordering criterion is completely independent from the information returned. This time we sort according to the composer's shoesize:
select $composer / name, $composer / birthdate where composed (... $composer ...) order by $composer / shoesize
Also not surprising is, that several ordering criteria can be specified; and in general one can also specify whether the sorting is ascending or descending:
select $composer / name, $composer / birthdate where composed (... $composer ...) order by $composer / shoesize, fn:count ($composer <- composer) desc
Here we first try to sort the binding sets according to the
shoesize. If there is a draw, then we use the number of associations
where the composer appears in the composer role. This number, just for
the sake of demonstration we use in a descending fashion.
Interestingly, the ordering clause can exist, but be empty:
select $composer / name, $composer / birthdate where composed (... $composer ...) order by
This captures the only remaining case that we want grouping, but without having to commit to any sorting order. Yes, I know, this is extremely elegant.
Independent from sorting of binding sets, also the partial tuple sequences can be subjected to sorting. This is directly encoded in the SELECT clause:
select $composer / name asc, $composer / shoesize desc ... order by ...
As expected, the default for ordering is ascending; but any sorting
only happens when there is at least one asc or desc
somewhere in the tuple expression. Otherwise, we are back to the case where we do not care
about the order.
In some cases, no grouping whatsoever should happen; instead, the overall result should be sorted according to a given criterion. This is actually a special case of grouping, whereby the group size is limited to 1.
To demonstrate this, let us assume we wanted to get a list of composer names, all sorted, together with the number of that composers' composed items:
select $name, fn:count ($composer <- composer) where $composer... & $name == $composer >> characteristics name order by $name
Obviously we have made the name explicit by introducing a variable for it. That way the
binding set will always have two variables, $composer and
$name. If the ORDER-BY clause does the sorting, it will sort the whole
result list.
This method can be generalized to more criteria if for each such a new variable is introduced.
The sorting mechanism within path expressions only relies on that for sorting tuples. As is the case with the other flavours, no sorting will occur. If I asked for all person's ages with the context map
// person (. / name, . / age)
then the result will contain a sequence of pairs with name/age pairs, all in no particular order.
Assuming for a second that there is exactly one name and one age for each person in the map, then
// person (. / name asc, . / age)
will exactly do what is expected, namely to rearrange the sequence so that the name component is ordered. Should a person have several age values (maybe unlikely), then one name may appear any number of times together with these ages, but one name will form a group.
Obviously it makes also a difference at which level the sorting is requested. In
// person / name asc
the first thing which is done is to generate all persons' names, and only then do sorting. This contrasts a more localized sort
// person ( . / name asc)
where for each person the list of names is generate, but only that list is ordered. When these partial lists are concatenated, nothing is said about the overall order.
TMQL expressions can be used to generate XML fragments. The idea is to acknowledge the fact of life that there will be various different organizational principles for content around for a while: relational data, hierarchical tree-oriented information, and graph-like information, such as Topic Maps (and, yeah, RDF).
If XML generation were not part of TMQL, there would be two other options to arrive at XML-organized data coming from a TM backend store: One path is to define a fixed XML vocabulary (in its own namespace) into which queried content is converted by the processor; the application will have to postprocess this in all likelihood so that it fits its purpose.
The other is to leave it up to the application to create DOM nodes itself directly according to its needs. This has to be done while iterating over a result list; if the application engineer wants to avoid that these list are huge, he will have to organize the XML generation into loops, starting with the top-level and then firing individual TMQL queries against the database in the inner levels. Needless to say, that this is VERY expensive as it needs a lot of interaction between application and TMQL processor. For this reason, the whole process has been moved into a TMQL processor, consciously making the language bigger for that part.
It is worth stressing that this is NOT templating. So a TMQL processor will not interpret the specified XML content as text stream into which bits and pieces from the queried topic map have to be embedded. Instead a RETURN clause is fully pre-parsed by the processor. In
return
<albums>{
for $a in // album return
<album>{$a / name [@ en ]}</album>
}
</albums>it will recognize XML content because of the leading angle opening bracket. It will follow the tags and uses specific rules how content is supposed to be embedded.
Most of the time, any generated content will be converted into its textual form as TEXT nodes. If topic map content is to be embedded, then the processor will use XTM, automatically converting topic map items into that format.
Regardless of the nesting level, the overall result is again a sequence of tuples. This time each tuple contains an XML node, be it a TEXT node to carry whitespaces and line breaks, be they ELEMENT nodes.
On the one hand, TMQL generalizes the XML structure in that it allows dynamic content to
be embedded. This is done using a {} pair, mimicking XQuery which does
the same. Still, there is one limitation that generated XML content
MUST start with an XML element and not with a TEXT element:
return
this will not work
<albums>{
for $a in // album return
<album>{$a / name [@ en ]}</album>
}
</albums>
The places where this can happen are quite limited, on the other hand. At this stage this is only allowed within attribute names, attribute values and inside XML elements. So it is possible to dynamically generate element names or attribute names. There is also support for namespaces, but not for processing instruction or CDATA nodes. This is all the application at the outset has to control.
For this TM generation CTM, the Compact Topic Map syntax (once it is finished), is used. The obvious use case is to transform one map (and the vocabulary used therein) into another map with a possibly different vocabulary. It could look something like this:
for $p in // person
where
has-composed (composer: $p, opus: $_)
return """
{ $p } # this copies the whole topic verbatim
{fn:id ($p)} isa composer
"""
For each composer we have found in the queried map, we simply copy the whole topic. This
is achieved by { $p }. The variable $p is already
bound to a topic item and when such an expression is encountered in an CTM text stream
where a topic is expected, the whole topic information is simply echoed there.
There are a few other similar rules where the place in the CTM syntax controls what actually should be embedded. One of them allows to embed whole name items:
for $p in // person
where
has-composed (composer: $p, opus: $_)
return """
* isa composer
{ $p / name } # this copies the name items
"""
The above would leave it up to a TMQL processor to generate an item identifier. But it
would make sure that this new topic is an instance of composer. Apart from that, all name
items of the person bound to $p are computed. And as that is done where
the CTM stream allows name items, all of these would be copied into the new topic. Of
course, this works the same way with occurrence items.
While generating content, query expressions can be nested in various ways, be
unconditional (enclosing it in {}) or
conditional using an if-then-else construct.
Content generation depending on a condition is most obvious when using the FLWR style
for $p in // person
return
if $p / shoesize > 32
($p / name, "bigfoot")
else
($p / name, "smallfoot")but it can be used in other styles:
select $p / name,
if $p / shoesize > 32 then "bigfoot" else "smallfoot"
Without much limitations, dynamically generated content can be used whereever content is expected:
select $group / name,
" members are: " + fn:string-join (
fn:tuple ({
select $person / name
where
is-part-of (member: $person, whole: $group)
}), ",")
where $group isa groupIt finds music groups and concatenates the member names.
The function fn:tuple only takes a tuple sequence and creates one,
large tuple from it, so that the result is a list of all atoms in the tuple sequence. The
fn:string-join is derived from Perl's join taking a
list of elements and a second parameter for the separator token to be used. In our case we
used a comma to eventually create one string.
The only consideration is that there are certain rules which the TMQL processor will apply if it has to embed a certain kind of content into another. So, for instance, will an XML fragment be serialized into text form if it is embedded into an expression involving operators as in the following example:
select $p / name,
"shoesize : " + # this is string concatenation
if $p / shoesize > 30
<bigfoot>
else
<smallfoot>
where
$p isa person
The query is actually initiated by finding first all instances of the concept
person. For each of them, the expressions in the SELECT clause are
evaluated. The first column there is always a string containing one person name. The
second column is also a string, but one that is concatenated from a constant string and
the result of a nested query expression.
There are 3 operators to construct larger portions of content from smaller ones.
The operator ++ does sequence concatenation, and depending on the
nature of content (tuple sequences, XML content or TM content) it does mean slightly
different things. For tuple sequences it means that the sequences are just
concatenated. If one has been ordered in some way then this ordering is maintained in that
any before-after relations are not destroyed. For XML content the ++
just combines fragments, building a larger fragment. And for TM content
++ is interpreted as merging.
Formally, the TMQL machinery treats everything as tuple sequences. XML nodes inside an XML fragment are organized as nodes in a tuple in a sequence. And topic map content is organized as items within a singleton tuple within a sequence. With that it is easy to write dedicated expressions for particular cases and then combine the results together whenever it is convenient.
The following example returns a complete topic map. First, it introduces some static concepts and then generates the rest from the map:
return """
grammy2007 iko grammy
"""
++
{
for $a in // artist
where
fn:random() > 0.95
return """
gets-award (award: grammy2007, awarded: $a)
"""
}
The operator ++ is then used to merge the individual TM fragments
as generated by the FLWR expression together.
The operator -- can be read as except, so it
subtracts elements from the second operand from the first. For tuple sequences (and thus
also XML fragments) this is defined via the comparability of tuples. For TM content it
implies that certain items will be suppressed, should they ever exist in the map.
To avoid, for instance, that Jessica Simpson never-ever gets a Grammy, we would tweak the above to:
return """
grammy2007 iko grammy
"""
++
{
for $a in // artist
where
fn:random() > 0.95
return """
gets-award (award: grammy2007, awarded: $a)
"""
}
--
"""
gets-award (award: grammy2007, awardie: jessica-simpson)
"""
The last content operator, == computes the intersection of the two
operand tuple sequences. It is very convenient when it comes to determine whether there is
an overlap between two sequences.
$p / birthdate == 1940-10-09
On the surface, we test whether the person's birth date has one particular value. But a
person might have several birthdate occurrences. The TMQL processor
would not know that without more ontological background. What really is tested here is
whether there exists one occurrence of type
birthdate, or in other words whether there is an overlap of
values between the tuple sequence on the left with that of the
right. == therefore implements exists semantics of
comparison. This is the justification for writing == and not
=.
@@@@@@@
In a closed world assumption everything which is not explicitly said is known to be wrong. In this sense, TMQL behaves like SQL in that it assumes that the map to be queried holds all available information.
This has a number of consequences, mostly on how the semantics is defined. When, for instance, variables are supposed to range over all possible items, then this is finite as any queried map is always finite. Another consquence is that the FORALL operator can be mapped to the EXISTS operator:
forall $p in // person satisfies $p isa composer == not some $p in // person satisfies not $p isa composer
And also the quantified quantifiers can be map to each other, as for instance
not at least 5 $p in // person satisfies $p isa composer
==
at most 6 $p in // person satisfies $p isa composer
@@@@@@@@@ check @@@@@
TMQL does not include general inferencing, so the derivation of new knowledge based on
existing, except for one which is directly wired into TMDM: taxonometric reasoning. That
simply involves that (a) if a concept C is a subtype of
B and that is in turn on of A, then
C is also subtype of A. And (b) it includes that if a
concept B is a subtype of A any instance of
B is also an instance of A.
Inferencing is enabled when the infrastructure knows more about the problem domain than the instance data (facts) within the map. That additional knowledge would allow a computer to defer more of the facts without these having to store anywhere.
Usually this additional knowledge is given via rules (predicates), functions, or also additional facts, such as topics or taxonometric knowledge (type system). As all of this can be regarded as ontological knowledge it was decided NOT to burden a TMQL processor with this functionality.
Instead, if such inferencing is needed, implementation will have to implement this in a layer which a TMQL processor will transparently use.
Potential implementors may want to risk a glimpse at the current editor TMQL draft. With its about 50 core syntax rules and roughly 25 shorthand notations it is well below the complexity of SPARQL. This is not counting in the grammar of CTM which an implementor may have to include. The standard will be around 35 pages (without appendices), which may be half that of SPARQL.
What is not so obvious at first sight is that the different language flavours themselves do not add any computational complexity. Both, select and FLWR expressions can be mapped into path expressions, so that most of this syntactical variations are swallowed by a parser anyway. This is also true for the majority of shorthand notations introduced; each of them can be dealt with a single line of code, at least in Perl.
With all that we request feedback from users and developers, regardless whether that revolves around usability, applicability to particular application domains or general feasibility of implementation.