Topic Maps are Maps!
A while back I ranted that topic maps are normally not visualized as (quasi) geographical maps. I argued that the map metapher is so natural to most of us that lifting it into a semantic space is worth a try.
My assumption is that semantic distance is to be derived not only from the topics in the map, but much more importantly from the documents which are directly or indirectly attached to the topic.
Accordingly, the starting point is a corpus, as implemented in TM::Corpus. To distill phrases out of texts is a dirty business, especially if you want to do this in a language like German.
Such a text corpus can then be analyzed for features, i.e. properties of all documents in the corpus. Typical features are word occurrences, such as the word microsoft appears 100 times in the first 30% of the document. But you can also add information about topic affiliation as feature. From that one can compute a vector space, each vector representing one document.
Typically such vectors have dimensions of 100 or more. This is obviously difficult to visualize, but for years there exist several clustering methods to reduce that many dimensions to only a few, say, to cluster feature vectors over a 2-dimensional plane.
Once you are there, you only need to create a 3D elevation model of your semantic space, add some sample documents (as the white dots in the picture) or label the landscape. The latter I have switched off to cloak the corpus.
Into the picture I have positioned those documents which are most influential for creating the landscape. Obviously they built the small hills around where they sit.
The relative heights of the hills represent the strength of the document(s) for that hill. Documents in the large plain in the middle are just that: plain.
The relative positioning of the hills themselves I interpret as the proximity of the terms which are represented by the hills. If the landscape is recomputed, the hills may be in different absolute positions. But the relative position seems to be always the same. I suspect an affine group behind that.
If you look closely you will notice that the 2D plane is actually a torus: hills reaching over the right border continue on the left and also the top and bottom border are joined. This has produced the best convergence results and seemed to be more meaningful to me.
Any crevices are just echos from the underlying randomization process. They do not mean anything, but incidentally give the map a more natural look.
I still have several problems which hinder me to put this on the road:
- speed (or lack thereof) of the cluster algorithm: Here we are looking at several options, ranging from using a faster vector processing machine to mapping the algorithm onto MapReduce.
- token detection: On the one hand I would like to throw away unsuable words (such as and and or). On the other, some words could be transformed into an equivalent noun. Unfortunately, the German Wordnet comes with an .... interesting licence.
- web services: I have already a working WS infrastructure for corpora, but it does not blend nicely with my new Topic Maps programming language.
But even if all this is solved, I am not sure whether critical pieces of the software will go into the public domain anytime soon.
Work supported by the Austrian Research Centers.