A Map of a Topic Map of MapReduce
Here is a visualisation of such a topic map:
Ok, ok, one map thing after the other:
MapReduce is a programming model for massively parallel computations, as introduced by Google. It is used for those who have very large amounts of flat data. Log files, large amounts of text or raw sensor values, etc. can be processed with it.
MR is one of the cloud computing technologies, Hadoop being the most popular implementation of it. But Microsoft is on their heels, with DryadLINQ as one (very interesting and much cleaner) framework.
MapReduce Topic Map
For about 2 years I casually observe this technology, and in order to organize my knowledge (and bookmarks) I maintain a MapReduce topic map. We Topic Map people do that.
Maps of Topic Maps
Previous year (time flies!) I set out with the working hypothesis that Topic Maps are maps. That project was side-lined for some time by other projects, but recently I have invested considerable effort improving the quality of these maps.
The snapshot above (larger version in the attachments below) shows one visualisation of such a MapReduce topic map.
My approach is to first look at the content inside the map, so all names and data occurrences. For URL occurrences I harvest the documents they refer to. This is all captured in a Topic Map Corpus. What follows then is classical feature extraction and the usual machine learning stuff (SVM, SOM, LSA).
What I tried to achieve is not only that the terms which appear in this landscape are organized by their semantic distance. If a term is used in many contexts, then this should translate into a larger contiguous area, visualized by a larger text font.
Obviously, much in this map corpus is about the web, although not the web as we know it, just that fraction which is relevant for mapreduce.
The mapreduce area is just a slope on the web mountain. That extends to the south and is wrapped around so that it continues at the north edge. In the same way the distributed area extends over the east edge to be continued in the west. That torus topology seemed most natural to me.
What should strike you as odd is that there is effectively one mountain and one plain. That is, because I have chosen here to derive the altitude not from the intensity of a term, but from the absolute semantic distance from the most intense one (i.e. web).
That was probably not a clever move, so I will change it back to the term intensity.
Update: This is how this looks like now (see attachments):
The locations have changed as I have reseeded the map. There is nothing like a center in a torus, right?
Note that mapreduce also occurs twice: once in the context of distributed web and once within the large plain in a hadoop context. That looked like a flaw to me first as I expected all terms to aggregate into one area.
But even after prolonged learning such singular aggregation does not take place. So I stick to a context theory for the time being. Accordingly, in a multidimensional space (I use over 80 dimensions) it is simply not always possible to flatten everything.
What supports this theory is for instance the combo of computing and cloud in the right/top corner. Since these terms coincide so frequently in MapReduce-related documents it is not surprising to find them in proximity.
That theory makes me happy. And it is easier than trying to look for a programming bug.
If one increases the number of visible areas in the map, then it becomes a bit more busy, but also shows more details:
Most detail is added in the green plain. Here many terms are competing for the space. I know that for sure as most learning is taking place there.
At the same time I have also increased the number of contours to allow you to speculate over the terrain.
What I have not done yet is to actually exploit the topological information within the topic map structure. I first wanted to know how far standard machine learning goes, so that I can then argue how much a semantic network can add to the precision.
So how can a Topic Map help?
Well, every document (internal or external) is affiliated with a topic. Either as reference, internal data, or as subject locator or identifier. What this implies is that I can boost every document into the direction of the topic it belongs to. A tutorial about MapReduce, for instance, would receive a boost into the direction of mapreduce, as if the document itself contains that very term prominently.
But documents are not only affiliated directly to one topic; via associations they are also connected to neighboring topics, albeit in a much weaker way (via Gauss dampened distances).
How strongly topics are connected to each other depends on the Topic Map topology. And that has to take into account link distances, all weighted by the frequencies of the individual association types.
Luckily we have a rainy weekend ahead of us.