TMQL Language Features (2008-04-01)

Robert Barta


  

$Id$

Abstract

This documents covers most of the relevant TMQL language features and design rationale. THIS IS WORK IN PROGRESS.


Table of Contents

Introduction
Interaction with Application (Part I)
Interaction with Application (Part II)
Functional Language
Multisurface Syntax
Shortcuts and Core Syntax
Using Constants
Numerics
Strings
Dates
URIs, URLs, IRIs, whatever
Identifying Topics
The TMDM way
Item Identifiers
Naming or other Properties
Combinatorial Identification
Ontological Commitments
Navigation
Axes
Control Topic
Directions
Navigation Shortcuts
Atomification
Reverse Atomification (De-Atomification)
Application-Specific Atomification/Deatomification
Atomification Shorthands
Following Reification
Variables
Variable Binding (Implicit vs. Explicit Quantification)
SELECT and free variables
Special Variables
Controlling Variable Bindings
Query Context
Environment Map
Conditions
Free Variables and Exists Semantics
Filters
Null, Undef, False and True
AND and OR
SOME and EVERY Quantification
Quantified Quantifiers
Optional Matching
Association Predicates (Part I)
Association Predicates (Part II)
Prefix Handling
Prefixes for Namespaces
Prefixes for Predefined Namespaces
Prefixes for External Maps
Ur-Environment
Functions
Side-Effect Free
Parameters
Named vs. Positional Parameters
Predefined Functions
External Libraries
Generating Content
Tuples of Simple Things
Grouping (Binding Set Sorting)
Grouping (Tuple Sorting)
Global Sorting
Sorting Within Path Expressions
Generating XML
Generating Topic Maps
Conditional and Unconditional Content
Content Operators
Architecture
Closed World
Taxonometric Inferencing
General Inferencing
Wrapping Up

Introduction

TMQL is an expression language which allows to extract content from topic maps, or more generally from an any backend store which is organized along the Topic Maps paradigm.

Like any other query language, TMQL has two components: One for detecting certain patterns in the data and a second where the result is computed. For the pattern detection TMQL offers not only purely pattern detection, such as all the people who are involved in a is-married-with association, but also navigation axes along which further information is investigated.

On the output side you not only can produce tabular data ― TMQL calls these tuple sequences ― but also XML content and Topic Maps content. Because of the latter TMQL and CTM have quite some functional overlap and care has been taken that the two languages can cooperate nicely.

Interaction with Application (Part I)

Implementations will of course provide an API for applications to invoke queries on a particular map (or a set of maps). For demonstration only, here an exemplary code fragment how a Perl implementation might offer TMQL functionality:

# first get hold of some map
use TM;
my $map     = new TM     (file => 'here.ctm');
# then produced a query object and later eval it
use TM::QL;
my $query   = new TM::QL ('select $p from ... $shoesize ...');
my $results = $query->eval ('%_' => $map, '$shoesize' => 42);

A query object would be built by handing in a valid TMQL expression to the constructor. At this stage processors are free to perform in-depth analysis of the expression, say, for optimization; the standard remains silent on that part.

If the query expression contained syntactical or semantic errors which can be detected without executing the expression, then this would be the right moment for implementations to report this fact to the application. What precisely these errors are, is written down in the standard, but it refrains from specifying a list of error messages and also from saying how these errors are reported, say as exceptions or as error codes.

Once a map has been established (here it is read from a file in CTM notation), it can be passed into the query processor. Additional parameters can be used to import values into the query expression. %_ just so happens to be a variable understood by the processor to be bound to the map to be queried. There is actually no obligation to hand in any map, in the same way as there is no upper limit on the number of maps to be handed in. This set of variable bindings is the initial binding context during the query execution.

Apart from that and the expression itself only an ur-environment is conveyed to the query processor. Here are ontological commitments such as 'what is a function', 'what is a prefix', 'what is a data type' and 'what primitive data types' are there. This ur-environment can be constant, even hardcoded. Implementations are allowed to extend it by, say, offering more data types and/or more functions and predicates.

Interaction with Application (Part II)

On the outgoing side a TMQL expression always creates a tuple sequence which it will return to the calling application. In the case of above Perl solution that tuple sequence can then be iterated over, whereby each tuple is simply one list of primitive values:

my $results = $query->eval ('%_' => $map, '$shoesize' => 42);
foreach my $tuple (@$results) {
   foreach my $value (@$tuple) {
      print $value;
   }
   print "\n";
}

If the query is used to build an XML fragment, then implementations may choose to return the individual fragments inside a tuple sequence, each fragment in one tuple. Or, alternatively, everything already merged into one single XML fragment. The standard does not mandate this.

Similar the case when a topic map is generated within a TMQL expression. Also here the implementation can choose to return content in small pieces or one chunk, depending on memory usage, speed or architectural decisions.

Functional Language

The aspect that a TMQL processor evaluates an expression to a final result is quite important, from a theoretical as well as from a pragmatic viewpoint. First, it makes the formal definition of the language rather straightforward once there is a proper (and simple) formal model of what a topic map is. While this may appear a purely academic exercise it opens the avenue to use term rewriting techniques to perform significant optimizations just by pre-evaluating subexpressions or even pruning complete subexpressions.

Once there is a precise, formal meaning for each expression more analysis can reveal how particular expressions can be rewritten into equivalent expressions with lower computational cost. This option will become even more attractive when implementations can leverage knowledge about the application domain, such as one provided by TMCL, OWL or any serious ontology language.

Once the query is evaluated implementations may also choose to leverage parallel architectures, such as grids, clusters or even online processor resources from Google or Amazon. This is only possible if the evaluation of one expression only depends on the results of its subexpressions, but not from any other context.

Multisurface Syntax

TMQL has 3 'surface syntaxes' to adapt to different usage scenarios.

The more convential syntax flavor leans itself towards SQL which is familiar with most developers, industrial or academic. Accordingly, the developer has to warp his brain a bit to determine first what he needs to be delivered, only to write down later how the things he gets are related:

   select $p / name, $p / shoesize
   where
       $p isa person
     & $p / shoesize > 42
	

In contrast to this, the FLWR syntax is more constructive, in that a developer constructs the result bottom-up:

   for $p in // Person
   where
       $p / shoesize > 42
   return
       ( $p / name, $p / shoesize )
	

One advantage is that all variables are clearly defined, and their scope is evident. Also the flow of thought might more reflect what many (procedural and functional) programmers are used to, namely to iterate over lists, filtering for interesting parts and returning a newly constructed list.

One of the real benefits of the FLWOR syntax is that it is also more obvious to generate XML content, especially as this looks almost the same as in XQuery:

return
   <persons>{
      for $p in // Person
      where
         $p / shoesize > 42
      return
         <person id="{$p}">
             <name>{ $p / name }</name>
             <shoesize>{ $p / shoesize }</shoesize>
         </person>
   }</persons>
	

Alone the fact that the content inside the RETURN clause starts with < indicates the intention to return XML fragments and not tuple sequences or Topic Map content. Otherwise the semantics is similar: each fragment is completed with the current value of $p to a XML fragments. All such fragments are concatenated and embedded into the <persons> element.

The FLWOR syntax can also naturally be used to generate TM content within the RETURN clause. At that we look later.

The third syntax flavour, path expression, is convenient when the query is actually very short and the result is quite simple. To get, for instance, the name(s) of all persons with a large footprint we can write

 // Person [ . / shoesize > 42 ] / name
	

Path expressions (PEs) follow a natural left-to-right flow: first all instances of class Person are computed from the queried map, then each is tested whether it has a shoesize larger than 42, and then only the 'surviving' person instances are used to compute their name(s).

In terms of expressitivity PEs are equal with SELECT and FLWOR expressions. This is undermined with the fact that all surface syntaxes can be mapped into one and the same primitive path expression language. The readability of PEs, though, quickly suffers once things get a bit more convoluted. And with SELECT expressions they share that only tuples can be produced as results.

Shortcuts and Core Syntax

All numbered productions in the TMQL specification form the core syntax. It covers the complete TMQL functionality, all the surface syntaxes. Writing practical TMQL expressions in this canonical form is highly inconvenient, though. To find all instances of a concept Person one would have to write as path expression

  %_ [ @_[0] >> classes  == Person ]
	

%_ stands for the currently queried map, or here for all things (topics and associations) inside this map. Each of these items is then taken individually (symbolized by the first and only column in the current tuple @_) and is used as starting point of a navigation along the types axis. Accordingly, all classes (their direct and indirect superclasses) are computed and are compared with the topic denoted via Person. If there is only a single overlap, then this item will survive the filtering process.

As the first column of @_ is needed quite often, @_[0] can be abridged with .. For >> classes == Person TMQL also allows to write ^ Person, so that the expression collapses to

%_ [ ^ Person ]

That can be further reduced to

%_ // Person

and then to

// Person

as %_ is the default map anyway.

Many such shortcuts have been devised, although it is still open how many will be in the finalized language. In any case, all these shortcuts do not change ANYTHING semantically. They are all defined in terms of syntactic transformations and can be completely resolved during parsing time.

Using Constants

TMQL has adopted a handful of native times, notably strings, integer and decimals to name the most usual suspects. While the complete list will be aligned with CTM, the compact notation, it is suspected that IRIs and dates will be included.

Adoption for a data implies that a TMQL processor will have to take care of the syntax of string representations of each type, how such a string is deserialized into an object of the data type and the other way round; and it will have to provide a set of functions for manipulating objects.

Numerics

For numeric types the rules (integers and decimals) are as one is used from other programming languages:

	42   # this is the answer to all things
        3.14 # this is as exact as it gets

        # some random operators
        42 + 3.14
        42 / 3.14
        - 3.14
        42 > 3.14
      

There will be not many operations on decimals, not only to keep the implementation costs low for TMQL developers, but also because any implementor can offer additional functions via an extension mechanism.

Strings

Strings will be fairly conventional, too. But there will be not many operations on them, except the ubiquous concatenation and the odd regular expression operator which matches strings against a regexp pattern.

        "The End is Not Near"

        # some random operators
        "The End is " + "Near"
        # I'd wish, but the Java mafia will not allow this
        $string =~ /End.+Near/
	

Dates

Probably the most interesting addition will be that of dates as first-class types in the language. This not only allows dates to be written down in one of the 100000 available formats, it also will allow modest date operations, including those to compare dates:

        2005-10-16T10:29Z

        $today > 2005-10-16
	

URIs, URLs, IRIs, whatever

Of course IRIs play an important role in Topic Maps and in addressing things. But also as data type they are supported, simply by putting them inside angle brackets:

	  <http://the.end.is.near.no/?>
	

There will probably not be too many operators for these.

Identifying Topics

One of the most central parts of Topic Maps is about addressing topics (and the subjects they stand for). On the one hand there are the two methods which TMDM mandates, those of subject indication (indirect addressing) and that of subject locators (direct addressing) where the subject itself has a IRI associated to it. But there are also other means, one of them being that the 'internal' item identifier is used; the other uses a property-oriented approach, namely, that you know something about the topic which makes it 'unique enough' to identify the topic in question.

The TMDM way

Obviously, the most reliably way to address the topic of your choice is to use the subject locator; if that subject actually has one, that is, because it may be rather a document. In that case it is also quite likely that that subject locator is used in the queried map. Accordingly, the following

<http://www.johnlennon.com/> =

will uniquely pinpoint the topic for this web site.

For all other subject which 'only' may have subject indicators we can use something like

<http://en.wikipedia.org/wiki/John_Lennon> ~

The problem with that is that there may be many choices for appropriate subject identifiers. The likelyhood that the information in the queried map and that in the query itself will remain in sync may be low. This is the place where adhoc ontologies may come handy.

Another problem with using IRIs as subject identifiers is that they clutter query texts, making everything hard to read and write. One usual escape route is to use prefixes. They have to be defined in some environment map (more about that later) and can then be used throughout the query text to form QNames:

<wp:John_Lennon> ~

Hereby we assume that wp is associated with the Wikipedia vocabulary (namespace).

It can be expected that subject identifiers will be the most predominant way of identifying topic. For this reason TMQL allows to write them shorter as item reference:

wp:John_Lennon

That looks almost like the above, but is actually quite different. Using a QName or URI standalone directly identifies the topic. If there is none with that subject identifier, then an error will be raised. Using the navigation <wp:John_Lennon> ~ first takes a URI literal and then tries to find all items which are using it as subject identifier. That list may be empty and no error is generated.

Item Identifiers

One cheapskate way is to work with the topic item identifier (sometimes referred confusingly as 'source locator' in older literature). It is an identifier which the topic got assigned when it is created via an API or when it is loaded from an external resource (such as a CTM file). Should this identifier be known, then it can be used within a TMQL expression, such as in

john-lennon

These identifiers are conveniently short, but they suffer from the fact that some infrastructures have no reliable and robust way to keep them stable over a longer period in time.

A smaller problem with using topic item identifiers is that the can collide with TMQL keywords. While there are not many potential collision points, they exist such as in

for $i in $p >> characteristic
where
  ....

Since a topic identifier can follow a characteristic axis, the where keyword will mistakenly be interpreted as topic identifier. As the following code, though, will not make any sense, this error can be detected at compilation time.

In such situations one can always use * to allow TMQL to greedily consume this as topic identifier.

Naming or other Properties

Apart from this more or less strong form of identifying a subject, names usually also offer a rather good approach. As names are strings, there are two steps involved: first the string has to be converted into a topic name item, and then this name item has to be affiliated with topic in question. All this is achieved with

"John Lennon" \ name

Accordingly, any name can be used as well, just to improve our chances to find the appropriate topic:

"John Winston Lennon" \ fullname

All this works once fullname is a subtype of a name.

This weak form of identification not only works with names. It also works with any property put into occurrences:

1940-10-09 \ birthdate

Of course we have to expect also other people as results which happen to have the same birthdate as John.

Combinatorial Identification

Depending on your original data you may or may not have reliable identification data. As queries are usually kept separate from the instance data in the topic map it is advantageous to formulate queries as robust as possible.

If you had several subject identifiers, you can actually combine these to improve your chances that there is a match with the map you query (shortcut ORing):

<http://guess1.com/> ~ ||
<http://guess2.com/> ~ ||
<http://guess3.com/> ~

If there existed no topic with the first URI as subject indicator, the processor would attempt to resolve the second URI as subject indicator in the map, and so forth.

If, on the other hand, your identification characteristica are so weak that many topics in the map are like to match one of them, you can use the intersection:

1940-10-09 \ birthdate 
 == 
"Yoko Ono" \ name <- woman -> man

The == operator in TMQL acts as intersection here, so even if there were several people with the same birthdate and even if Yoko would have remarried, the overall result would most likely contain only john-lennon.

Ontological Commitments

There are a few predefined subjects TMQL itself assumes to exist. They should have been detailled in TMDM, but are now in the TMDM/TMRM mapping which provides the model semantics for TMDM.

One of the more prominent ones is tm:subject which stands for everything which is in the map. It only includes the topic map items for that matter, so not any literals. It can be abridged with *.

Navigation

One primary purpose of every query language is to extract content from the backend store. That can be done in two different ways: Either by describing declaratively how the result should look like and then match this pattern against the instance data. The bindings generated in this match are then returned to the application. SQL and other pattern matching languages follow this paradigm.

Alternatively, things of interest in the instance data can also be found by choosing a known starting point and then navigating along also known axes to the information to be identified. One representative of this approach is XPath.

As the two approaches complement each other, TMQL adopts both, at least on the user level, where convenience is the primary directive. Both approaches can be used within one query expression. That allows the user to choose the most convenient and therefore the most maintainable way. Effectively both approaches are equivalent and indeed the formal TMQL semantics maps everything into path expressions.

Axes

Obviously the navigation axes provided by TMQL are those insinuated by the TMDM (Topic Maps Data Model). If you had a particular topic then you can follow one axis leading to all the topic's names. Another axis would lead to all its occurrences, another axis would deliver all topics which are the type of our original topic, yet another axis would connect to all instances. There are several of these axes, all listed in the specification (section 4.4). Some only are meant to be useful when starting with a topic item, others are meant for associations or occurrences, such as "finding the scope(s) of an association, or occurrence, or name", or "find all roles of a certain role type in an association", or "find all players of a certain role type in an association". Other axes are related to addressing (subject identification and locating), some with following reification, or with converting topic map items into values and back.

As example, let us consider the topic john-lennon. If we follow the name axis from there

john-lennon >> characteristics name

then we can expect to get all name items of that topic. If there were a biography occurrence, then

john-lennon >> characteristics biography

would render these occurrence items.

It is worth noting here, that TMQL does not make too much distinction between names and occurrences and subsumes them both under characteristics. Characteristics here do not include involvements in any associations (as was once historically the case).

To find the type of john-lennon we can follow the types axis:

john-lennon >> types

and expect to see a list containing topic items, such as probably person.

When we had an association in our focus, then we can ask for its type, or it scope

assoc-2352633773 >> scope

although most of the time their item identifier is elusive.

Control Topic

For some axes it is important to specify a topic to control what is actually asked for, as in the occurrence example above. For other axes the control topic does not matter, as for finding all types of a topic. Such a control topic, if provided, is always interpreted by honoring any subclass(es). If our john-lennon topic contained not only a biography but also an early-years occurrence

john-lennon isa person
biography  : http://....
early-years: He was a child when he was young.

and that is known to be a subclass of biography

early-years iko biography

then the navigation step

john-lennon >> characteristics biography

will return both occurrences. That feature can be useful to find all occurrences:

john-lennon >> characteristics tm:occurrence

That works because tm:occurrence is predefined by TMQL as a concept from TMDM and proper implementations have to know that all ocurrences in a map are an instance of tm:occurrence.

In the same vein, we can ask for all names via

john-lennon >> characteristics tm:name

relying on the fact that tm:name is predefined as well.

All characteristics can also be retrieved:

john-lennon >> characteristics tm:subject

Directions

All axes can be travelled upon in two directions, a forward and a backward direction; it is quite arbitrary, though, which is which. To find all instances of a concept we navigate along the types axis, but this time in reverse:

person << types

To find all subclasses of person we simply reverse the supertypes axis:

person << supertypes

Reversing is also useful to find all involvements of a certain topic in associations. The expression

john-lennon << players victim

would find all associations where john-lennon plays the victim (or any subclass thereof).

Different directions also make sense when navigating between topics on one side, and subject locators and subject identifiers on the other. If you had a topic in your hand, then in forward direction the locators axis would render all subject locators (or just the one). In backward direction you would start with a URL and would get all topics which use that URL as subject locator. Similar for the subject identifiers.

Navigation Shortcuts

Many of these navigation movements have shortcuts, so for instance

john-lennon ^

is expanded to the canonical

john-lennon >> types

and therefore returns all types of the topic john-lennon. If we had to extract the homepage occurrence from John, then the shorter

john-lennon / homepage

will expand into

john-lennon >> characteristics homepage >> atomify

To follow the player axis into associations and out of associations can be much more conveniently done with

john-lennon <- victim -> aggressor

than with

john-lennon << players victim >> players aggressor

While there are no shortcuts for extracting subject locators and subject identifiers from a topics, some exist for the other direction: To address a topic via one of its subject identifier a simple

<http://en.wikipedia.org/wiki/John_Lennon> ~

is sufficient; and if addressing has to happen via a subject locator, then

<http://www.johnlennon.com/> =

can be used to reach the topic which represents this website.

Atomification

An atom in terms of TMQL is a data value with no further internal structure, at least as far as TMQL is concerned. Atoms are all integers, strings, dates, so any value for the predefined TMQL data types; or for any type which a particular TMQL installation additionally provides.

The problem with Topic Maps is that (a) according to the TMDM, values can only be string representations (loosely affiliated with a URI indicating the data type) and secondly (b), that values are always part of an occurrence item. That not only contains the value, but also the scope and occurrence type. In this sense, characteristics, so both names and occurrences are experienced in an ambivalent way: either as the complete item, or only as the value in it.

TMQL resolves this ambiguity first by making each of the involved steps explicit: first there is the extraction of the characteristics item from a topic item; then there is the conversion of this characteristics item into an atom. This last process is named atomification and is also managed by an axis:

john-lennon >> characteristics birthdate 
            >> atomify

The obvious advantage of this explicit chain of navigation steps is that before the atomification process other intermediary steps can take place. One example would be filtering according to the scope

john-lennon >> characteristics birthdate
            [ @ wikipedia ]
            >> atomify

or

john-lennon >> characteristics birthdate 
            [ 0 ] 
            >> atomify

if we were only interested in a single birthdate characteristics.

Ultimately, it is the user who has to decide whether atomification has to be done, or not. Still, TMQL offers a shortcut for the ― much more frequent ― situation that atomification should be done automatically. More about that later.

Reverse Atomification (De-Atomification)

The atomify axis can also be used in reverse. Accordingly, the situation is inverse in that we start off with a literal value and end up with all occurrences (or names) where this literal is used as value.

One example use is to find all characteristics where the string John Lennon is used:

"John Lennon" << atomify

Now, that this not extremely helpful until we continue the chain and compute those topic items where those characteristics are attached to it with a certain type :

"John Lennon" << atomify << characteristics tm:name

Before we see how this can be abridged, let us use the same principle when using birthday occurrences:

1940-10-09 << atomify << characteristics birthdate

This time we navigate to all people who are born on Oct 9, 1940; also for this a shorter version exists.

Application-Specific Atomification/Deatomification

Atomification not only applies to characteristic items, but to other items as well. In terms of the TMQL specification the result of atomifying a particular topic is a 'null-operation', i.e. leaves the item untouched. Implementations are free to redefine this process and use their own atomification rules.

Consider the query expression

select $p
  where $p isa person

By default, the querying application would get topic items of type person as-is, i.e. as items according to TMDM. If the query would be modified to

select $p >> atomify
  where $p isa person

then a TMQL processor will be asked to trigger atomification for all these person topics before they are returned into the application. Still, by default, this will be a null-operation, so nothing changed by adding the atomification step.

If our application, however, had defined various object classes, such as PERSON and if it would have configured the TMQL processor to populate object instances of the classes automatically, then it would get PERSON objects without any further ado. Here, how this could look like in Perl:

package PERSON;

# here are the methods and constructor

1;

my $q = new TM::QL ('select $p ... ');
$q->register ( serializers => { 'person' => 'PERSON' });
my $results = $q->eval (...); 
	

The first line would create a query object. Before that query is evaluated, the application tells the processor that it wants to associate the class PERSON in the Perl program with the type person in the map. Sometimes these objects are referred to as business objects.

Of course, this is all outside the TMQL specification; also how de-atomification would work in this case.

Atomification Shorthands

While it may be intellectually satisfying to use axes as conceptual framework to navigate through a Topic Map instance, in everyday situations they are just too cumbersome to write. For this purpose TMQL introduces a number of shorthand notations to alleviate the pain.

To find all biographies of John Lennon it is actually sufficient to write

john-lennon / biography

because TMQL processors will expand this to the canonical form

john-lennon >> characteristics biography >> atomify

Similarily, the shorthand for getting names

john-lennon / name

can be used instead of

john-lennon >> characteristics tm:name >> atomify

There is, unfortunately, a small complication: if a TMQL processor would atomify immediately these characteristics then something like this

john-lennon / name [ @ family ]

will not work: The filter would only 'see' the literals; and these could not be filtered according to the scope, or anything else for that matter.

The regulation is that an 'atomify' navigation movement does not immediately create literal values. It only schedules the literal for atomification, so postpones the process until actually the literal is needed.

That is the case in only well-defined situations: Either the characteristic is passed in as parameter into some other function, be it for arithmetic operations or for comparison, the characteristic is passed back into the application, or is a value is needed because we want to de-atomify it. Here is a slightly artificial example:

john-lennon / birthdate << atomify

First we figure out when John Lennon is born, and then we use the date literal to get all characteristics where it is used as value. There is also the anti-symmetric shortcut to do de-atomification and to navigate to the involved topics. The expression

john-lennon / birthdate \ birthdate

will return all topics (most likely people) who are born at the same day as John. The \ birthdate expands to << atomify << characteristics birthdate.

The shorthand for the reverse atomification also works for names. This can come quite handy if one happens to know a name of a certain type and has reasons to believe that it is sufficiently unique within the map to identify the topic in question:

"Ringo" \ nickname

Following Reification

According to TMDM you can reify certain things in a map, specifically characteristics, associations and complete maps. In a sense, moving from a topic to the thing it reifies is a bit like zooming.

Let us assume that the queried map contains an association

marriage (man: john-lennon, woman: yoko-ono)

and that this association is reified by a topic two-in-bed which contains more details about this relationship.

If this topic is used as starting point, then the underlying association can be easily found:

two-in-bed ~~> -> man

~~> is the shortcut for following the reifier axis forward.

If characteristics, so names or occurrences were reified, this would work exactly the same way. Also maps can be reified, something we will have to revisit later. But as soon as you have a topic which reifies a map, you can zoom into this map and get all its items. The following gets all names of operas in the opera maps:

opera-map ~~> // opera / name

Variables

Variables in TMQL are unlike variables in a procedural, state-oriented programming languages, in that they are storage for values and values are assigned and reassigned during the course of the execution. Instead, TMQL follows the tradition of functional programming languages (yes, there is a deeper reason behind this): Variables there are bound to values at some point and then are married until death tears them apart; death here meaning the end of the variable's scope.

The scope of a variable is a lexical range within a query expression. It is normally quite obvious, especially for the FLWR style where variables are explicitly declared with a FOR clause. In that case their scope extends until the end of the enclosing FLWR expression. Similarily, for variables declared with SOME or EVERY clauses: also here their scope extends to the end of that clause.

For path expressions the situation is quite simple: you cannot declare any variables, so scope is not an issue. This is, of course, a double-edged sword. It makes simple queries very simple, but does not scale well with complexity. The only variables of use in path expressions are %_ for the whole map, i.e. all items in it, the @_ for the whole incoming tuple and $0, $1, etc. for the individual components of the incoming tuple. All other variables are treated effectively as constants.

Using the SELECT style, variables are not declared explicitly, but are handled implicitely depending on how a query expression is used; more about that below.

Variable Binding (Implicit vs. Explicit Quantification)

One way variables get their values is that they are specified to range over a sequence of (computed or constant) values. This is the way FOR, SOME and EVERY clauses work:

for $p in // person
...

In this case the variable is explicitly quantified.

Alternatively, a query author can keep the variable unquantified, leaving it to the processor to implicitly let the variable range over all possible values:

select $p
where
   $p isa person

Here $p iterates over all items in the currently queried map and effectively each value is then tested in the WHERE clause against the condition therein. For practical reasons this set of all possible values must be finite; TMQL defines this as the 'map', i.e. all topic, name, occurrence and association items.

In this sense, TMQL operates on an closed world assumption, i.e. does not regard unknown information to be available outside its universe.

SELECT and free variables

When using the SELECT flavour in TMQL then some care has to be taken when variables are involved. Consider for instance

select $p
where
    $p isa person

Here it is obvious that the variable $p should range over all instances of person and that each of these topics should be returned. More precisely, $p is constrained by the condition in the WHERE clause. If we would add a new variable $x in the SELECT clause

select $p, $x
where
  $p isa person

then $x is NOT constrained.

While one could take an orthodox position and interpret this as saying "find all instances of person and pair each of these with ANYTHING within the queried map", this is most likely not what the author had intended. More likely than not, queries such as these are more the result of a typo, an error, a misunderstanding, or a combination of all of the above. Allowing them to be valid may result in very expensive queries.

To avoid such unintentional dramatic consequences, the above expression is ruled to be invalid: a TMQL processor expects all variables appearing in a SELECT clause to be (a) either bound to a value as provided by the context, or (b) to be mentioned explicitly within the WHERE clause.

Should the user really want all things, then he has to say so:

select $thing
where
  $thing isa tm:subject

Special Variables

TMQL knows about a few special variables, although most of the time you might not be interested in them. What is peculiar about them is that some are read-only, so that cannot be redefined with new values, and some are write-only so that when values are assigned to them, they cannot be retrieved.

Anonymous Variable

One such special variable is $_. It can be used inside a WHERE clause when a variable is necessary for a successful match, but where one is not interested in its actual value (read: "do not care"):

select $p / name
where
    is-leader-of (organisation: $_, leader: $p, ...)
  & $p isa person

Here we try to identify all leaders, i.e. persons who have been leading an organisation at some point. We do not overly care here which organisation it is, so we use $_ to signal this to the TMQL processor, and equally important, to the human reader.

Several instances of $_ within the same scope are regarded to be independent from each other. So in the query

select $p / name
where
   is-leader-of (organisation: $_, leader: $p, ...)
 & $p isa person
 & is-part-of (whole: mafia, part: $_)

the use of $_ will not have the intended effect, namely that any matched organisation will also be part of the mafia. This also implies that the variable $_ can never be 'read', i.e. used in a SELECT or RETURN clause. The following is invalid

select $_ / name
where
   $_ isa person

Incoming Tuple

Another special variable is @_. It refers to 'the current tuple' as a whole, so the (@_) projection in

john-lennon ( . / name, . / birthdate ) (@_)

is completely redundant.

More interesting are the individual tuple components, which are ordered from left to right to be bound to $0, $1, $2, etc. To flip two columns, you could use

john-lennon ( . / name, . / birthdate ) ($1, $0)

but the variables can certainly be used as starting points for any navigation or computation:

john-lennon / shoesize ( $0 + 10, $0 \ shoesize )

That path expression would first compute John Lennons shoesize and would then create a pair where the first value is the shoesize increased by 10 and the second component being all topics which have John Lennon's shoesize.

Tuple variables which are NOT bound to values lead to an error, so

(1, 2) ( $2 )

will make the processor terminate.

Binding Stack

Like any other query language, TMQL has to deal with two phases: one in which values in the queried map is identified and the second where that content is used to produce new content.

To convey these values from the incoming to the outgoing phase values are bound to variables. At one particular point in time a query processor will always look at a variable binding set, i.e. a set of variables and the values bound to them.

Since variable are scoped, i.e. they are only visible in well-defined parts of the query expression, some variable will get their values from outer expressions and variables defined only in a nested scope will get values there. Effectively, a processor maintains a stack of such variable bindings. Whenever a nested query expression introduces a new variable (implicitly or explicitly) such binding will be put onto that stack. Once the nested subexpression is completely evaluated, that binding will disappear from the stack.

Context Map

One other variable is %_, it stands for the currently queried map, or to be more precise, for all the items in the context map. When an application invokes a TMQL processor it may pass the map to be queried into the query process simply by associating it to %_ in the first place. Then it is not necessary to mention the map inside the query expression any further as in

select $p / name
where
    ....

because the default (also in path expressions and using FLWR) is that %_ is used anyway:

select $p / name
from %_
where
    ....

You can also explicitly name the map which should be queried, say, one which has been affiliated with the variable %mymap at some point:

select $p / name
from %mymap
where
    ....

Doing that will implicitly create a new variable %_ which binds the contents of %mymap, so that throughout the rest of the query that is used as context map.

Another way to change the map to be queried is to follow the reification axis, i.e. zoom into the map. For this let us assume we had a topic, say, opera which is defined within the map (or in the environment map). If that topic reifies a map, then

select $p / name
from opera ~~>
where
    ....

How the TMQL processor and the underlying Topic Map infrastructure handles map reification is outside the scope of TMQL.

It can even go that far, that we do not need explicitly a topic which stands for a complete map. Also a URL, interpreted as subject identifier, can serve for this purpose:

select $p / name
from <file:/home/user/mymap.xtm> ~ ~~>
where
    ....

When the TMQL evaluates the FROM clause, it does the usual thing: It first finds a URL literal, followed by ~. This should indicate a topic with that URL as subject identifier. TMQL processors can temporarily assume such topic in the environment map. More importantly, the reification step will lead from the topic to the map, so will ask the processor to consume the map and make it the current context map %_.

NOTE: This is an experimental feature.

The map to be queried is not necessarily one which is statically stored in a file or in a database, although that might be the most common case. It is also possible to compute a map before it is queried; a map is just a set of items anyway.

As one example we again query the map bound to the variable %mymap, but this time we add to it an opera ontology which happens to be stored on a remote web server:

select $opera / name
from %mymap ++ <http://far.aw.ay/opera.ctm> ~ ~~>
where
    ....

'Adding' here means of course merging. Whether a TMQL processor will try to download this ontology over and over again, or will cache it locally, we do not care; these are all operational details.

Environment, Context and Effective Map

Whenever a query is evaluated, it is done in the context of a map. In many cases such a map will be passed in as parameter from the application, but it is also possible to pass in several maps and ― for a particular subexpression ― select one of them to make that the context map. If it becomes necessary, one can access this context map via a special variable, %_; but all operations by default are referring to it.

In any case there is also another map which is always present: the environment map. It contains everything the TMQL processor has to know, starting from the concept of a data type and that of a function (or predicate) up to all data types themselves the implementation offers, together with the operator and functions.

The environment is not necessarily fixed. For every (sub)expression new environmental information can be added. In the most primitive case such local environments will define prefixes together with a namespace URI, so that in that subexpression these prefixes can be used to form QNames that namespace. Semantically speaking, though, such prefixes are nothing else than a shorthand to address an ontology. That, as for PSI sets, will simply contain a list of subject definitions. Or, more generally, such a ontology might include a whole taxonomy, so not only the subject definitions, but also a type system for them. Or, even more generally, the ontology might contain predicates, constraints or functions to describe to structure of the domain in question.

TMQL does neither forbid or mandate all of this. Its only expectation is that all the ontological information comes in form of a topic map: prefixes for namespaces (vocabularies) are simply topics representing the whole ontology with the namespace URI as subject indicator. Functions are topics of a certain type carrying the function body with them, as so are also predicates (constraints).

Once the environment map is known for a particular subexpression, it will be 'seen together' with the current context map, or ― in Topic Maps speak ― both will be merged for the duration of that expression. If you ever wanted to access it, you can use the variable %% is bound to it.

Controlling Variable Bindings

As we have seen above, variables can be bound to values. Used naively, this can lead to incorrect queries. As an example let us find all two albums which share the same producer. In a first attempt we write:

select $album1, $album2
where
   is-produced-by ($album1: production, $producer: producer)
 & is-produced-by ($album2: production, $producer: producer)

If you have worked with declarative languages before, you may immediately spot the problem: For the TMQL processor $album1 and $album2 are completely different variables; the variables might be bound to the same or to different value for the same producer, the processor does not care.

This does not work for us if we want different albums. The usual escape hatch is to have something like this:

select $album1, $album2
where
   is-produced-by ($album1: production, $producer: producer)
 & is-produced-by ($album2: production, $producer: producer)
 & not $album1 == $album2

Not only is this ugly as hell, in 100% - ε of all cases developers will forget to add this (I know I will). And it does not actually reflect the developer's intention; and it does not look too elegant if you have to compare three or more such variables.

TMQL has a rather eccentric way to fine-control when variables are allowed to match anything or when they must be bound to something different. It is using primes after the variable names:

select $album, $album'
where
   is-produced-by ($album : production, $producer: producer)
 & is-produced-by ($album': production, $producer: producer)

Now we have used two variables which only differ by the number of primes (') appended. TMQL treats them as two distinct variables, but with the additional semantics that ― within one and the same binding ― they cannot be bound to the same value.

There is no limit to the number of primes, so should we ― by a bizarre twist of fate or customer requirements (whatever comes first) ― need three different albums, this can be achieved elegantly:

select $album, $album', $album''
where
   is-produced-by ($album  : production, $producer: producer)
 & is-produced-by ($album' : production, $producer: producer)
 & is-produced-by ($album'': production, $producer: producer)

Query Context

While a query is executed, the TMQL processor will keep book on which variables are bound to which values. This data structure is organized as stack: Whenever a new variable binding, or more generally, a set of such bindings, has been created, the whole set will be pushed onto this stack. With that stack the nested inner part of a query expression is evaluated and the results are collected. Once this is done, the last binding set is removed (popped) from the stack and possibly a new binding set will take its place to repeat the evaluation of the nested query expression.

That way ― while binding sets vary ― new results will be assembled into a larger result. Once the query is complete, again the last binding set will be removed.

To kick off the whole process, there must be an initial binding set which the calling application (or the TMQL infrastructure) provides. While the application is free to pass in any number of variables together with values, the only thing the TMQL really needs is a binding for %% to some map. That map will be interpreted as the environment map. From then on the TMQL processor will create new binding sets on its own.

Environment Map

@@@ TO BE MERGED @@@ One special variable is %%, the current environment. It contains everything the processing infrastructure of a TMQL has to offer: predefined data types, functions, predicates and also other ontological information. And since we are in Topic Map-land, that whole environment is modelled as topic map. Data types are topics, so are functions, predicates and so are external vocabulary (namespaces).

Since it is a map, we can access the information by querying %%. The following lists all functions (predefined or otherwise), i.e. their names together with their description:

select $f ( . / name, . / description )
from %%
where
   $f isa tmql:function

The only thing to be done is to switch the context map onto the environment map and to query for the function topics. All these must be an instance of tmql:function, one of the types TMQL defines in its own ontology.

To find all namespace URIs we have to find first all topics in the environment map which are ontologies, i.e. something which represents a whole namespace. Once we have such a topic, we extract the subject indicator(s) for it.

select $o, $o >> indicators
from %%
where
   $o isa tmql:ontology

TMQL processors will all have to provide the ur-environment, so that is used by default. But it may certainly possible to allow applications to change it:

my $query   = new TM::QL ('select $p from ... ');
my $map     = new TM     ('file:here.ctm');
my $results = $query->eval ('%_'        => $map, 
                            '$shoesize' => 42,
                            '%%'        => $my_env);

Conditions

WHERE clauses and filters of path expressions always contain boolean expressions. Their main purpose is to describe declaratively a particular pattern within the queried map. The processor's task is to find all combinations of variable bindings which make the boolean expression true.

Boolean expressions can be combined with boolean operators. These are all non-short-circuit, i.e. the order of the individual expressions in an AND or an OR does not matter.

Free Variables and Exists Semantics

In the query

select $p / name
where
   $p isa person
 & plays-instrument (player: $p, instrument: drums)

the free (and only) variable is $p. A TMQL processor will therefore existentially quantify this variable, i.e. let it range implicitly over all items of the map. Those constellations of values which make the boolean expression in the WHERE clause true, will be passed on to later processing stages in form of a variable binding set. All other bindings will be discarded.

Of course, the let a variable range over all items in the map is just the specification's formal way to say we do not care how an implementation does it, so that implementations can develop clever and fast mechanisms such as indices to find fast what is actually needed.

In the above case, an implementation may recognize the $p isa person part and will ― instead of actually looping over all topics and associations in the map ― use an index delivering person topics fast. Maybe it also maintains an index over plays-instrument associations and will simply compute an intersection of the two indices.

What is also worth noting, is that free variables will therefore only take values from the map, but not values from the primitive data types, such as integer. As a consequence the condition below will never be satisfied:

...
where
    $p isa person
 & $p / birthdate > $d

While $p is ranging happily over all person topics, there will be no map item for $d which will make the second condition true. This should be seen as another safety feature.

Only that a variable appears in a WHERE clause does not mean it is free. In the FLOWR query style all variables are declared together with the range they should iterate over:

for $p in // biological-unit
where
   $p isa person
 & plays-instrument (player: $p, instrument: drums)

Now $p is non-free and the processor will not automatically let it range over all possible values.

Filters

Once a tuple sequence has been produced, a filter can be applied to select only those tuples which satisfy a certain criterion. Accordingly, filters are postfixed to the expression producing the tuple sequence. To find all instruments Paul McCartney plays which are heavier than 50 kg, one might write:

paul-mccartney <- artist -> instrument [ . / weight > 50 ]
	  

While filters can contain any (primitive) boolean condition, they are usually quite short. To allow for even more conciseness a number of shortcuts have been introduced. To filter for scope a simple [ @ my-scope ] will do. To filter for certain types [ ^ my-type ] is enough.

There are also shortcuts when it comes to filtering along the position in the incoming sequence. To get the first tuple from a list it is sufficient to write [0]. Also slices are covered, although with a smaller gotcha for Perl and Java developers and only the lower bound is inclusive. The filter [3 .. 5] will select those with the indices 3 and 4 but not that with 5.

Null, Undef, False and True

Primitive boolean expressions might be values, so also (among others) atoms such as undef, or the boolean values true or false. In the ― rather synthetic example

select 1
where
   undef

This will be expanded to

select 1
where
   exists undef

and that further to

select 1
where
   some $_ in undef satisfies not null

Now that undef is definitely a value ― albeit an undefined one ― there exists a binding for $_. And since the not null is always TRUE, the whole clause will evaluated to TRUE.

What applies to undef also applies to all constant values, so as a consequence it also applies to the boolean atoms true and false:

select 1
where
   false

That maybe somewhat non-intuitive for programmers which would expect to not return anything.

It is the constant null which actually fills the role to indicate that falsehood. It expands to the empty sequence ().

AND and OR

@@@@

SOME and EVERY Quantification

To express a condition which is only supposed to hold for a defined set of variable bindings, TMQL offers the SOME clause. If our universe of discourse would contain several people, but only one single musician, then

some $p in // person
  satisfies $p isa musician

would be true.

Interesting is the corner-case where there are no persons in the first place; here the semantics rules that the whole condition is false.

The SOME clause is actually only syntactic sugar and can be always rewritten as FLWR expression. In our case this would be

for $p in // person
where
  $p isa musician
return $p

That expression returns a non-empty tuple sequence exactly when there is at least one person who is a musician.

The SOME clause is effectively generalizing the exists semantics of value comparison. To make this obvious we observe that the condition

where
   $p <- artist -> instrument / name == 'Piano'

is equivalent with the more longwinded

where
   some $n in $p <- artist -> instrument / name 
     satisfy $n == 'Piano'

The EVERY clause is syntactic sugar on top of the SOME clause and should only avoid that query authors will have to twist their brains. If someone watches too much Australian Idol TV shows, then he might write:

every $p in // person
   satisfies $p isa musician

Since TMQL assumes a closed world, this can be equivalently transcribed into

not (some $p in // person
     satisfies not ($p isa musician) )

which is exactly what the formal semantics does.

One of the consequences is, though, that if there is not a single person in the map, then the whole condition is true.

Quantified Quantifiers

There are also use cases where it is necessary to give upper and lower limits on how many matched patterns there exist. If we had to look for girlie power bands, i.e. those where there are at least 5 members, then a query

where
   group isa $group
&  at least 5 $g in $group -> member
      satisfies $g isa female

can achieve this. Of course, there is also a way to constrain the upper bound and that is simply done by using at most N. As with the lower bound N must be a positive integer greater than 0.

These clauses seem to be superfluous as one might achieve the constraint also by counting the numbers

fn:count ( $group -> member [ . isa female ] ) >= 5

but there is a subtle, but important difference: With count one is effectively asking to compute all values and then count these to compare them with the lower or upper limit.

Optional Matching

Conditions do not allow optional matching, i.e. to set up sub conditions which may, or may not be true. SPARQL, the query language for RDF-based data has to have this as it is only pattern oriented. In TMQL path expressions provide a way to deal with optional information: one just tries to navigate to the relevant parts and when the result happens, one can choose whether a default value such as undef should be used:

select $p / name, $p <- member -> group / name || undef
where
   $p isa person

In the example above we list all persons in our map and additionally check whether they are part of a (music) group. If they happen to be no group member, then undef will take up one value there. That avoids that the person is completely discared if it had no membership.

Association Predicates (Part I)

Every so often one needs to constrain topics by their involvement in associations. If we needed to find all musicians who play drums then the following may deliver this:

select $p / name
where
   $p isa person
 & plays-instrument (artist: $p, instrument: drums)

What we are actually looking for are all associations of type plays-instrument, where there is one player $p who plays the role artist and another player drums for the role instrument.

A TMQL processor, however, will interpret this a bit more abstract, in that it also allows matching associations be of any subtype of plays-instrument; and also the role types in matching associations may be subtypes of those specified (artist and instrument); which implies that an association in a map

conducts (conductor: karajan, orchestra: berlin-philharmonic)

will be successfully matched by the predicate as long as conducts is a subclass of plays-instrument, conductor subclasses artist and orchestra subclasses instrument.

What we are also implicitly saying with the above predicate, is that there must not be any other role in such matching associations. An association with other roles, such as

conducts (conductor: karajan, 
          orchestra: berlin-philharmonic, 
          concert  : beethovens-ninth)

should be dismissed. Should this not happen, then this has to be signalled to the TMQL processor in that other roles may well exist when matching with the predicate. This is achieved with adding the ellipsis ... as last player:

select $p / name
where
   $p isa person
 & plays-instrument (artist: $p, instrument: drums, ...)

Association Predicates (Part II)

While association predicates look like a special language feature, they are actually path expressions in disguise. The predicate invocation

plays-instrument (artist: $p, instrument: drums, ...)

is interpreted as look for all associations of type plays-instrument (and its subtypes), which have one role artist (or any of its subtypes) with the value which coincides with that of $p; and another role instrument with a player drums. This can be formulated as path expression quite easily:

// plays-instrument [ . -> artist == $p ] [ . -> instrument == drums ]

Since the ellipsis at the end indicates that we do not care about other roles, both forms are equivalent.

Things get slightly more complicated if the ellipsis is missing, i.e. only the two roles and no further ones are allowed. Also this can be translated into a path expression, albeit a less obvious one:

// plays-instrument [ . -> artist     == $p ]
                    [ . -> instrument == drums ]
                    [ not (. >> roles -- artist     << superclasses
                                      -- instrument << superclasses ) ]

The only thing which has to be changed is the last filter. It first extracts all roles from the association under consideration. From this list it deducts then first the artist (and all its subclasses) and then the instrument (and its subclasses). If there is any other role left, then the association does not pass the test.

Prefix Handling

A TMQL processor cannot operate in a vacuum. Inside a query expression it has to be possible to use outside information, such as data types, predefined or not, and their related functions; also ontological information about the map being queried must become available, be it just the taxonomy or be it something which also includes rules and constraints in some sort of rule language, or even logic. What it also involves is that it must be possible for a developer to add his own functions, predicates and concepts to a TMQL expression as needed. This is all collected in the processing environment.

Conceptually speaking, everything of the above can be regarded to be ontological information, and ― since we are still in TM-land ― as topic map. To achieve this viewpoint, all predefined concepts have to be thought as topics, possibly organized into classes and possibly connected via associations.

Prefixes for Namespaces

Using whole URIs to identify topics (actually the subjects they stand for) certainly clutters queries. Using prefixes for namespaces is a comfortable way to shorten queries considerably, so TMQL also uses this mechanism; albeit, not in a syntactic way, like in XML, or RDF, but more appropriately here, in a semantic sense.

It is straightforward to view namespaces as (external) vocabularies. One example of this would be Wikipedia which itself is organized in a topic-oriented way, giving many concepts a distinct URL.

If we plan to use many references to topics done with subject identifiers from the Wikipedia URL space, then using a prefix for http://en.wikipedia.org/wiki/ can help to keep queries readable. Such a prefix can be easily declared in the environment map, just before the query expression it is meant to visible:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ CTM?
wp isa tmql:ontology ~ http://en.wikipedia.org/wiki/
"""
select $p / name
where
   $p / birthdate <= wp:John_Lennon / birthdate

The environment map here contains only a single topic, one with item identifier wp and subject indicator http://en.wikipedia.org/wiki/. The fact that this topic is an instance of a tmql:ontology (more about that later) makes the TMQL processor do only one thing: it will register that wp can be used as prefix in QNames in followup query expressions. The namespace bound to this prefix is that Wiki URL.

If now inside a query expression an item reference is a QName with such a prefix, that will expand to the subject identifier provided for it, in our case http://en.wikipedia.org/wiki/John_Lennon. If we had a topic in our queried map which is using this as subject identifier, then we have successfully addressed our topic.

These prefixes also work inside IRI literals, where QNames are used as shorthand, such as in "wp:John_Lennon" = to address the Wikipedia page about John Lennon itself.

Prefixes for Predefined Namespaces

We can drive the above prefix mechanism further by interpreting the namespace indicated by the URI as a namespace known to the TMQL processor. Obvious examples here are all the namespaces for those things the TMQL processor has to know about in the first place, such as the namespace for TMDM concepts (topic, association, ...), one for all the data types it offers (integers, etc.), including all the functions and operators in it; and, of course, one namespace for TMQL itself, where it defines things like tmql:ontology we have already used above.

How a processor knows certain namespaces ― apart from the predefined ones ― and how it can learn about others, is outside TMQL itself. But it should be clear that such a mechanism is an excellent way to extend existing processing environments.

As one example consider the following Python code

import math

proc = TMQL.processor()
proc.registerNS ('http://maths.is.great/', math)

proc.eval ('select ....')

In this hypothetical TMQL implementation we would first create a processor object. Before it is used to evaluate a TMQL query, the application registers a library under a particular URI. In the query itself we declare a topic which uses this URI as subject identifier:

"""
math isa tmql:ontology ~ http://maths.is.great/
"""

The processor will register that math is a new prefix. It also will look at the connected namespace URI and will then realize that this very URI has been used before by the application to register a whole Python mathematics library. If now in the followup query expression a function math:sqrt is referenced, then the processor will simply invoke this function from that library.

Prefixes for External Maps

In an ideal world, the queried map will contain all information necessary to produce meaningful results. But for practical reasons this will not be always the case. Sometimes bits and pieces of the information have to be added explicitly, sometimes a whole set of vocabulary or even a type system have to be added to the queried map.

So, before a query can be executed this missing knowledge has to be added, i.e. merged into the currently queried map. Such additional domain knowledge can be simply more topics or more associations. It can also include additional functions and also predicates.

In any case, there are two minor options here: Either that knowledge is stored in an external resource, a local file or on a remote server; or that knowledge is directly embedded into the query. Which one to choose, is simply a matter of convenience.

Let us first assume that we only need little additional information which we would like to hard-code into the query, namely, that Tom Waits and Kathleen Brennan have composed an opera, a fact missing from the queried map. Accordingly, the environment map contains a topic which itself contains a topic map in one of its occurrences

tomwaits-info isa tmql:ontology
! name: Tom Waits Info
description: just a few bits an pieces we need for the query
tmql:body: """

  composed-by
     composer: tom-waits kathleen-brennan
     opus    : opera-alice

  opera-alice isa opera
  ! name: Alice

  # subject indicators for all topics go here

"""

Now that tomwaits-info is known to stand for a topic map we can use the zoom operation to reach that map and add it to the context map %_:

select $opera
from %_ ++ tomwaits-info ~~>
where
   $opera isa opera
 & composed-by (composer: tom-waits, opus: $opera)

If we were diligent enough to take care that all appropriate topics of the original map are merged with the little information we had in our inlined map, then the opera Alice would be part for the result.

Alternatively, we could also have stored additional information into a separate resource, say, into a file tomwaits-info.ctm. In that case the URL file:/where/ever/tomwaits-info.ctm serves as perfect subject identifier for the map therein. The environment will contain now:

"""
tomwaits-info isa tmql:ontology ~ file:/where/ever/tomwaits-info.ctm
""

Whenever we ask the TMQL processor to zoom into this map with tomwaits-info ~~>, then it will realize that there is no inline tmql:body, but instead a subject identifier. It will resolve this resource and will consume it as CTM stream (which might be the default format). Once this has been successfully loaded, it can merged with the queried map.

As a minor variation of the theme we can also consider other formats than CTM, say LTM.

"""
ltm           isa tmql:ontology  ~ http://www.ontopia.net/download/ltm.html

tomwaits-info isa tmql:ontology
              isa ltm:instance   ~ file:/where/ever/tomwaits-info.ltm
""

What has changed to before is that the Tom Waits information is now additional an LTM instance, in other words a stream according to LTM. That it is LTM and not something else we tell the TMQL processor by declaring the ltm namespace on the first line. It is the provided URI which will help the TMQL processor to recognize the LTM deserializer software. That can be hard-coded into a platform which is using LTM frequently. But like the the example with the math library before, the application could be made in charge of this as well. In that case it might be responsible for binding a deserializer object to the namespace:

import Ontopia.TopicMap.LTM

proc = TMQL.processor()
proc.registerNS ('http://www.ontopia.net/download/ltm.html',
                 LTM.Deserializer())

proc.eval ('select ....')

What if an ontology exists, but is not in Topic Map format but, say, in OWL? TMQL processors are free to convert from other notations and paradigms into the Topic Map universe:

"""
owl   isa tmql:ontology ~ http://www.w3.org/2002/07/owl#

opera isa tmql:ontology
      isa owl:Ontology  ~ http://www.ope.r.us/opera.owl
"""

In a sense the topic map representing the opera ontology is virtual.

Ur-Environment

To bootstrap all knowledge a TMQL processor has about the world, we have to define first TMQL's concepts, such as 'function', 'predicate' and 'ontology'. This is done by the standard: @@@@@@@@@@@@@@@@ CTM

tmql isa: tmql:ontology ~ http://www.isotopicmaps.org/tmql/1.0

to affiliate the TMQL vocabulary with that subject identifier.

At evaluation time the usual happens: the TMQL processor will analyze the environment map, will find a topic (tmql) with a certain subject identifier, will recognize that as a predefined one and will provision the TMQL concepts for further use.

Apart from some minimal TMQL concepts, a TMQL processor has to understand types, specifically the adopted primitive types, such as integer or decimal. Also this information is organized in the initial environment map, including facet information about these types, such as their serialization syntax (how is a value written as text, what is the canonical representation). And types certainly are organized into a taxonomy (type system), so that, say, integer is modelled as subclass of decimal.

Functions

Functions in TMQL work quite conventionally at first sight: If a function is invoked, then first the parameters are evaluated, then passed into the function. The body of the function then somehow produces a result whereby TMQL functions are pure as they cannot modify anything on the outside, at least as the TMQL processor is concerned. It is still possible to create side-effects by calling functions from external libraries, and these are free to do whatever they are programmed to do.

To allow a function to be invoked, it has to be declared first. For the set of predefined functions this is the task of the processing environment. @@@@@@@@@@ NO OWN FUNCTIONS...@@@@

Semantically speaking, functions define a relationship between values. With that, functions describe properties and constraints of the application domain. In this sense they are ontologic knowledge.

Side-Effect Free

@@@

Parameters

f (@l) handed in as list of simple things, f (%_) list of tuples

Named vs. Positional Parameters

Especially with functions having few parameters, it can be cumbersome to always explicitly name the formal parameters (named parameter affiliation). TMQL functions can also get their parameters using positional parameter affiliation.

To demonstrate this, we rewrite the function nr-albums, replacing all variables which are meant to be parameters with $0, $1, etc:

"""
nr-albums isa tmql:function
description: computes number of albums for a certain person
tmql:return: fn:count (// albums [ . <- opus -> author == $0 ])

"""

The function can then be invoked with nr-albums ($p) now.

Functions are not restricted to produce simple values. As the body of the function can be anything which returns content, a function can return also whole tuple sequences. They can also return TM or XML content, as in the following example:

get-xml-albums is-a tmql:function
description: produce an XML fragment with all accounts
tmql:return: "
  <albums>{
  for $a in // album return
      <album>{ $a / name }</album>
  }</albums>
"

Whenever the function is invoked, it will create an XML fragment carrying an <albums> node; embedded in it are <album> nodes with the album title.

Predefined Functions

As TMQL offers predefined data types (integer, decimal, etc.) and also defines its own (tuple sequences), it will provide also functions and operators for values of these types. The current thinking is that all relevant functions and operators from XQuery 1.0 and XPath 2.0 are included. Needless to say that this would be cool but it also puts quite some burden on implementors as this list is massive.

In any case, both, the functions and the operators, will be preloaded for an TMQL processor:

@@@@@@@@@@@@@@ NEW SYNTAX @@@@@@@@@@@@

"""
fn isa tmql:ontology ~ http://www.w3.org/TR/xpath-functions#

op isa tmql:ontology ~ http://www.w3.org/TR/xpath-functions#
"""

just to make sure that the prefixes fn and op can be used inside a query:

select op:numeric-add ($p / age, 20)
where
  ....

(Of course, that can be written shorter as $p / age + 20, as the infix operator is mapped onto op:numeric-add. Which operators can be used as prefixes, infixes or postfixes will be defined in the TMQL standard.

External Libraries

TMQL also includes a rather abstract extension mechanism to allow TMQL infrastructures to add function libraries. For TMQL a suite of functions (and the related types and objects) is 'just an ontology'.

So if we had a Python library for HTTP support, then a TMQL processor might pick up on the following subject identifier

"""
http isa tmql:ontology ~ http://my.name.space/http/
"""

because the application linked the library with it at some earlier point. Here an example how this might pan out in Python:

import httplib
proc = TMQL.processor()
proc.registerNS ('http://my.name.space/http/', httplib)

When the TMQL processor encounters a function invocation such as http:get ('http://www.google.com') it will first associate the prefix http with 'some ontology' represented by a topic. It will then find that topic in the environment and will check ― well, at least once ― its subject identifier. Since that matches that URI of one of registered libraries, it will try to call get there. The remaining details are all local matter for the TMQL infrastructure.

As small variation of the scheme is to have the functions not in a 3rd party library, such as in Java, Python or better Perl; instead the external function library uses TMQL itself. In that case all functions must be defined as topics within a map, say, using CTM, LTM or AsTMa: @@@@@@@@@@@@@@@@

...
group-size isa tmql:function
desc: returns size of a Pop group
tmql:return : { fn:count ($0 <- group -> member) }

nr-girlie-groups  isa tmql:function
desc: computes nr of girlie groups
tmql:return : .....

...

Having this stored in file:groupies.atm, that URL can also serve as subject indicator for the group of functions which can be found there. In the TMQL query expression we simply refer to it:

"""
grp isa tmql:ontology ~ file:/usr/local/tmql/groupies.atm
"""

and then use the functions as in, say, grp:group-size (wp:U2).

The procedure for the TMQL processor is also similar to above. This time, though, it will have to follow the subject identifier file:/usr/local/tmql/groupies.atm to hunt down the necessary information. We leave it here to the processor to figure out itself the format the map is stored in. In any case, we expect the processor to parse the document, and realize the functions in there.

The fact that functions are topics, allows a last variation. A processor may allow a developer to use a language other than TMQL, say, Python:

"""
ctime isa tmql:function
return @ python: """
    from datetime import ctime
    return ctime()
"""

"""

All we needed to do was to set the scope accordingly. Needless to say, that this is another excellent extension mechanism, but only if the function itself is short enough to be directly included. Which probably rules out COBOL, FORTRAN and Java (grin).

Generating Content

Tuples of Simple Things

When using the SELECT or the path expression flavour of TMQL, then you will always get a sequence of tuples of 'simple things', i.e. atomic values. In the expression

select $p / name, $p / birthdate
where
   $p isa person

the overall result is a table, the first column filled with people's names and the second column with their birthdate(s).

But only if our queried map is formed in such a way that every person has exactly one name and exactly one birthdate, then there will be a one-to-one relation with the result tuple sequence. If one person had two names, then each name would appear with one and the same birthdate; if one person had no birthdate information, a tuple for the person might never be generated.

In most cases this is the intention, in others you may want to use a default value to enforce a value there:

select $p / name, $p / shoesize || undef
where
   $p isa person

and in other cases, you maybe want to be certain to get exactly one name (regardless of its type):

select $p / name [0], $p / shoesize || undef
where
   $p isa person

This is the place where the developer is exposed to the flexibility of the TM data model. Ultimately, it is him (or her) to decide what should go into the final result.

Grouping (Binding Set Sorting)

TMQL does not need a dedicated syntax for grouping, because it is implicit in the way tuple sequences are constructed. According to the TMQL semantics first binding sets are tested, and only if the tests succeeded with these bindings sets new tuple sequences will generated.

To illustrate this, let us consider the following:

select $p / name, $p / shoesize
where
   $p isa person

Here first all binding sets will be generated where $p is bound to one person item. Only then for each such binding set the tuple expression in the SELECT clause is evaluated. Each such individual binding set creates a tuple sequence, so that the grouping is done along the binding sets. The overall result is then the concatenation of all these partial tuple sequences. In the above case the processor will do this concatenation, but as there is no further requirement to keep the partial tuple sequences together, processors may deliver the whole sequence in any order.

One way of grouping the partial sequences is using the ORDER BY clause:

select $composer / name, $composer / birthdate
where
  composed (... $composer ...)
order by
  $composer / name

For each binding set the processor will evaluate the value expression in the ORDER BY clause and will expect to see exactly ONE value there (it can be empty, though, and if there are more, just one is picked). Then the respective binding sets are sorted according to these values. In this order then the binding sets are used to evaluate the tuple expression inside the SELECT clause.

One consequence of all this is, that a processor will now have the blocks of partial tuple sequences ordered according to the composer's name. Inside one such block (again, one partial tuple sequence may be arbitrary long) there is no ordering at all.

One small variation demonstrates that the ordering criterion is completely independent from the information returned. This time we sort according to the composer's shoesize:

select $composer / name, $composer / birthdate
where
  composed (... $composer ...)
order by
  $composer / shoesize

Also not surprising is, that several ordering criteria can be specified; and in general one can also specify whether the sorting is ascending or descending:

select $composer / name, $composer / birthdate
where
  composed (... $composer ...)
order by
  $composer / shoesize, fn:count ($composer <- composer) desc

Here we first try to sort the binding sets according to the shoesize. If there is a draw, then we use the number of associations where the composer appears in the composer role. This number, just for the sake of demonstration we use in a descending fashion.

Interestingly, the ordering clause can exist, but be empty:

select $composer / name, $composer / birthdate
where
  composed (... $composer ...)
order by

This captures the only remaining case that we want grouping, but without having to commit to any sorting order. Yes, I know, this is extremely elegant.

Grouping (Tuple Sorting)

Independent from sorting of binding sets, also the partial tuple sequences can be subjected to sorting. This is directly encoded in the SELECT clause:

select $composer / name asc, $composer / shoesize desc
  ...
order by
  ...

As expected, the default for ordering is ascending; but any sorting only happens when there is at least one asc or desc somewhere in the tuple expression. Otherwise, we are back to the case where we do not care about the order.

Global Sorting

In some cases, no grouping whatsoever should happen; instead, the overall result should be sorted according to a given criterion. This is actually a special case of grouping, whereby the group size is limited to 1.

To demonstrate this, let us assume we wanted to get a list of composer names, all sorted, together with the number of that composers' composed items:

select $name, fn:count ($composer <- composer)
where
   $composer...
 & $name == $composer >> characteristics name
order by
   $name

Obviously we have made the name explicit by introducing a variable for it. That way the binding set will always have two variables, $composer and $name. If the ORDER-BY clause does the sorting, it will sort the whole result list.

This method can be generalized to more criteria if for each such a new variable is introduced.

Sorting Within Path Expressions

The sorting mechanism within path expressions only relies on that for sorting tuples. As is the case with the other flavours, no sorting will occur. If I asked for all person's ages with the context map

// person (. / name, . / age)

then the result will contain a sequence of pairs with name/age pairs, all in no particular order.

Assuming for a second that there is exactly one name and one age for each person in the map, then

// person (. / name asc, . / age)

will exactly do what is expected, namely to rearrange the sequence so that the name component is ordered. Should a person have several age values (maybe unlikely), then one name may appear any number of times together with these ages, but one name will form a group.

Obviously it makes also a difference at which level the sorting is requested. In

// person / name asc

the first thing which is done is to generate all persons' names, and only then do sorting. This contrasts a more localized sort

// person ( . / name asc) 

where for each person the list of names is generate, but only that list is ordered. When these partial lists are concatenated, nothing is said about the overall order.

Generating XML

TMQL expressions can be used to generate XML fragments. The idea is to acknowledge the fact of life that there will be various different organizational principles for content around for a while: relational data, hierarchical tree-oriented information, and graph-like information, such as Topic Maps (and, yeah, RDF).

If XML generation were not part of TMQL, there would be two other options to arrive at XML-organized data coming from a TM backend store: One path is to define a fixed XML vocabulary (in its own namespace) into which queried content is converted by the processor; the application will have to postprocess this in all likelihood so that it fits its purpose.

The other is to leave it up to the application to create DOM nodes itself directly according to its needs. This has to be done while iterating over a result list; if the application engineer wants to avoid that these list are huge, he will have to organize the XML generation into loops, starting with the top-level and then firing individual TMQL queries against the database in the inner levels. Needless to say, that this is VERY expensive as it needs a lot of interaction between application and TMQL processor. For this reason, the whole process has been moved into a TMQL processor, consciously making the language bigger for that part.

It is worth stressing that this is NOT templating. So a TMQL processor will not interpret the specified XML content as text stream into which bits and pieces from the queried topic map have to be embedded. Instead a RETURN clause is fully pre-parsed by the processor. In

return
   <albums>{
     for $a in // album return
         <album>{$a / name [@ en ]}</album>
   }
   </albums>

it will recognize XML content because of the leading angle opening bracket. It will follow the tags and uses specific rules how content is supposed to be embedded.

Most of the time, any generated content will be converted into its textual form as TEXT nodes. If topic map content is to be embedded, then the processor will use XTM, automatically converting topic map items into that format.

Regardless of the nesting level, the overall result is again a sequence of tuples. This time each tuple contains an XML node, be it a TEXT node to carry whitespaces and line breaks, be they ELEMENT nodes.

On the one hand, TMQL generalizes the XML structure in that it allows dynamic content to be embedded. This is done using a {} pair, mimicking XQuery which does the same. Still, there is one limitation that generated XML content MUST start with an XML element and not with a TEXT element:

return
   this will not work
   <albums>{
     for $a in // album return
         <album>{$a / name [@ en ]}</album>
   }
   </albums>

The places where this can happen are quite limited, on the other hand. At this stage this is only allowed within attribute names, attribute values and inside XML elements. So it is possible to dynamically generate element names or attribute names. There is also support for namespaces, but not for processing instruction or CDATA nodes. This is all the application at the outset has to control.

Generating Topic Maps

For this TM generation CTM, the Compact Topic Map syntax (once it is finished), is used. The obvious use case is to transform one map (and the vocabulary used therein) into another map with a possibly different vocabulary. It could look something like this:

for $p in // person
where
   has-composed (composer: $p, opus: $_)
return """
       { $p }   # this copies the whole topic verbatim

       {fn:id ($p)} isa composer
"""

For each composer we have found in the queried map, we simply copy the whole topic. This is achieved by { $p }. The variable $p is already bound to a topic item and when such an expression is encountered in an CTM text stream where a topic is expected, the whole topic information is simply echoed there.

There are a few other similar rules where the place in the CTM syntax controls what actually should be embedded. One of them allows to embed whole name items:

for $p in // person
where
   has-composed (composer: $p, opus: $_)
return """
       * isa composer
       { $p / name }   # this copies the name items
"""

The above would leave it up to a TMQL processor to generate an item identifier. But it would make sure that this new topic is an instance of composer. Apart from that, all name items of the person bound to $p are computed. And as that is done where the CTM stream allows name items, all of these would be copied into the new topic. Of course, this works the same way with occurrence items.

Conditional and Unconditional Content

While generating content, query expressions can be nested in various ways, be unconditional (enclosing it in {}) or conditional using an if-then-else construct.

Content generation depending on a condition is most obvious when using the FLWR style

for $p in // person
return
   if $p / shoesize > 32
      ($p / name, "bigfoot")
   else
      ($p / name, "smallfoot")

but it can be used in other styles:

select $p / name,
       if $p / shoesize > 32 then "bigfoot" else "smallfoot"

Without much limitations, dynamically generated content can be used whereever content is expected:

select $group / name,
       " members are: " + fn:string-join (
                               fn:tuple ({
                                   select $person / name
                                   where
                                       is-part-of  (member: $person, whole: $group)
                                   }), ",")
where $group isa group

It finds music groups and concatenates the member names.

The function fn:tuple only takes a tuple sequence and creates one, large tuple from it, so that the result is a list of all atoms in the tuple sequence. The fn:string-join is derived from Perl's join taking a list of elements and a second parameter for the separator token to be used. In our case we used a comma to eventually create one string.

The only consideration is that there are certain rules which the TMQL processor will apply if it has to embed a certain kind of content into another. So, for instance, will an XML fragment be serialized into text form if it is embedded into an expression involving operators as in the following example:

select $p / name,
       "shoesize : " +     # this is string concatenation
       if $p / shoesize > 30
          <bigfoot>
       else
          <smallfoot>
where
   $p isa person
	  

The query is actually initiated by finding first all instances of the concept person. For each of them, the expressions in the SELECT clause are evaluated. The first column there is always a string containing one person name. The second column is also a string, but one that is concatenated from a constant string and the result of a nested query expression.

Content Operators

There are 3 operators to construct larger portions of content from smaller ones.

The operator ++ does sequence concatenation, and depending on the nature of content (tuple sequences, XML content or TM content) it does mean slightly different things. For tuple sequences it means that the sequences are just concatenated. If one has been ordered in some way then this ordering is maintained in that any before-after relations are not destroyed. For XML content the ++ just combines fragments, building a larger fragment. And for TM content ++ is interpreted as merging.

Formally, the TMQL machinery treats everything as tuple sequences. XML nodes inside an XML fragment are organized as nodes in a tuple in a sequence. And topic map content is organized as items within a singleton tuple within a sequence. With that it is easy to write dedicated expressions for particular cases and then combine the results together whenever it is convenient.

The following example returns a complete topic map. First, it introduces some static concepts and then generates the rest from the map:

return """

    grammy2007 iko grammy

    """
    ++
   {
    for $a in // artist
    where
       fn:random() > 0.95
    return """

      gets-award (award: grammy2007, awarded: $a)

    """
    }

The operator ++ is then used to merge the individual TM fragments as generated by the FLWR expression together.

The operator -- can be read as except, so it subtracts elements from the second operand from the first. For tuple sequences (and thus also XML fragments) this is defined via the comparability of tuples. For TM content it implies that certain items will be suppressed, should they ever exist in the map.

To avoid, for instance, that Jessica Simpson never-ever gets a Grammy, we would tweak the above to:

return """

    grammy2007 iko grammy

    """
    ++
   {
    for $a in // artist
    where
       fn:random() > 0.95
    return """

      gets-award (award: grammy2007, awarded: $a)

    """
    }
    --
    """
     gets-award (award: grammy2007, awardie: jessica-simpson)

    """

The last content operator, == computes the intersection of the two operand tuple sequences. It is very convenient when it comes to determine whether there is an overlap between two sequences.

$p / birthdate == 1940-10-09

On the surface, we test whether the person's birth date has one particular value. But a person might have several birthdate occurrences. The TMQL processor would not know that without more ontological background. What really is tested here is whether there exists one occurrence of type birthdate, or in other words whether there is an overlap of values between the tuple sequence on the left with that of the right. == therefore implements exists semantics of comparison. This is the justification for writing == and not =.

Architecture

@@@@@@@

Closed World

In a closed world assumption everything which is not explicitly said is known to be wrong. In this sense, TMQL behaves like SQL in that it assumes that the map to be queried holds all available information.

This has a number of consequences, mostly on how the semantics is defined. When, for instance, variables are supposed to range over all possible items, then this is finite as any queried map is always finite. Another consquence is that the FORALL operator can be mapped to the EXISTS operator:

forall $p in // person satisfies $p isa composer
==
not some $p in // person satisfies not $p isa composer
	

And also the quantified quantifiers can be map to each other, as for instance

not at least 5 $p in // person satisfies $p isa composer
==
    at most  6 $p in // person satisfies $p isa composer
	

@@@@@@@@@ check @@@@@

Taxonometric Inferencing

TMQL does not include general inferencing, so the derivation of new knowledge based on existing, except for one which is directly wired into TMDM: taxonometric reasoning. That simply involves that (a) if a concept C is a subtype of B and that is in turn on of A, then C is also subtype of A. And (b) it includes that if a concept B is a subtype of A any instance of B is also an instance of A.

General Inferencing

Inferencing is enabled when the infrastructure knows more about the problem domain than the instance data (facts) within the map. That additional knowledge would allow a computer to defer more of the facts without these having to store anywhere.

Usually this additional knowledge is given via rules (predicates), functions, or also additional facts, such as topics or taxonometric knowledge (type system). As all of this can be regarded as ontological knowledge it was decided NOT to burden a TMQL processor with this functionality.

Instead, if such inferencing is needed, implementation will have to implement this in a layer which a TMQL processor will transparently use.

Wrapping Up

Potential implementors may want to risk a glimpse at the current editor TMQL draft. With its about 50 core syntax rules and roughly 25 shorthand notations it is well below the complexity of SPARQL. This is not counting in the grammar of CTM which an implementor may have to include. The standard will be around 35 pages (without appendices), which may be half that of SPARQL.

What is not so obvious at first sight is that the different language flavours themselves do not add any computational complexity. Both, select and FLWR expressions can be mapped into path expressions, so that most of this syntactical variations are swallowed by a parser anyway. This is also true for the majority of shorthand notations introduced; each of them can be dealt with a single line of code, at least in Perl.

With all that we request feedback from users and developers, regardless whether that revolves around usability, applicability to particular application domains or general feasibility of implementation.