Topic Maps are Maps!

A while back I ranted that topic maps are normally not visualized as (quasi) geographical maps. I argued that the map metapher is so natural to most of us that lifting it into a semantic space is worth a try.

http://kill.devc.at/system/files/test.jpg

My assumption is that semantic distance is to be derived not only from the topics in the map, but much more importantly from the documents which are directly or indirectly attached to the topic.

Method

Accordingly, the starting point is a corpus, as implemented in TM::Corpus. To distill phrases out of texts is a dirty business, especially if you want to do this in a language like German.

Such a text corpus can then be analyzed for features, i.e. properties of all documents in the corpus. Typical features are word occurrences, such as the word microsoft appears 100 times in the first 30% of the document. But you can also add information about topic affiliation as feature. From that one can compute a vector space, each vector representing one document.

Typically such vectors have dimensions of 100 or more. This is obviously difficult to visualize, but for years there exist several clustering methods to reduce that many dimensions to only a few, say, to cluster feature vectors over a 2-dimensional plane.

Once you are there, you only need to create a 3D elevation model of your semantic space, add some sample documents (as the white dots in the picture) or label the landscape. The latter I have switched off to cloak the corpus.

Interpretation

Into the picture I have positioned those documents which are most influential for creating the landscape. Obviously they built the small hills around where they sit.

The relative heights of the hills represent the strength of the document(s) for that hill. Documents in the large plain in the middle are just that: plain.

The relative positioning of the hills themselves I interpret as the proximity of the terms which are represented by the hills. If the landscape is recomputed, the hills may be in different absolute positions. But the relative position seems to be always the same. I suspect an affine group behind that.

If you look closely you will notice that the 2D plane is actually a torus: hills reaching over the right border continue on the left and also the top and bottom border are joined. This has produced the best convergence results and seemed to be more meaningful to me.

Any crevices are just echos from the underlying randomization process. They do not mean anything, but incidentally give the map a more natural look.

Next Steps

I still have several problems which hinder me to put this on the road:

  • speed (or lack thereof) of the cluster algorithm: Here we are looking at several options, ranging from using a faster vector processing machine to mapping the algorithm onto MapReduce.
  • token detection: On the one hand I would like to throw away unsuable words (such as and and or). On the other, some words could be transformed into an equivalent noun. Unfortunately, the German Wordnet comes with an .... interesting licence.
  • web services: I have already a working WS infrastructure for corpora, but it does not blend nicely with my new Topic Maps programming language.

But even if all this is solved, I am not sure whether critical pieces of the software will go into the public domain anytime soon.


Work supported by the Austrian Research Centers.

AttachmentSize
test.jpg10.92 KB
buchteln.jpg21.71 KB
buchteln-small.jpg11.27 KB
Posted In

Thrilling stuff

I simply love data visualization, and as such I of course want more, especially given our TM focus. I've made lots of more graph-like structures, but would love to have time to play around with 3D / geographic / proximity space for subject-centric computing.

Your torus clusters sound interesting. Do you have any other examples you've made?

Alexander (not verified) | Wed, 06/04/2008 - 09:01

Re: Thrilling stuff

Your torus clusters sound interesting. Do you have any other examples you've made?

The machine is producing more as we speak, uhm write. I'm still in the process of learning how to use the algorithm and which of the 10000 knobs to tweak and which are better left alone. I'll post my findings when I have wrapped my mind around it, but just as an appetizer look at

http://kill.devc.at/system/files/buchteln-small.jpg

Here I have used far too few topics (documents) for the chosen high resolution. Each object is occupying its space and is forming a small hill around it.

These remind me of Buchteln, henceforth this will be named the buchtel effect. The cook in you will rejoice.

And I invented a term today, so I have fulfilled my quota and the day full of meetings was not completely wasted.

rho | Wed, 06/04/2008 - 21:40