HD Maps: And Where Are The F*&*() Topics?
(continued from Part III)
Now that I know where the documents are located in the landscape, I have experimented with ways to estimate where the topic map topics are supposed to be. My hypothesis is that if I can determine the distance of each document to every topic, I can triangulate the topics.
Below (larger version in the attachments) is a new rendering of the MapReduce theme:
It shows the themes derived from the semantic corpus (documents + semantic network). Compare this with the positions of topics:
(admittedly the rendering is not pleasing: text overlaps, texts are truncated, the color sucks, ...)
In many cases the derived theme and the topics at that location correspond nicely. Look for instance at "Holumbus", or "Cloudera", or "HDFS", or "Dryad".
But there are others, such as "Pig", which is quite off, i.e. where the topic is dragged into a different area. I have already a theory why this is so.
Others are completely off, eg. "DryadLINQ", or "Dumbo". With these I believe to have far too little content to arrive at plausible results.
What is certainly interesting is that there are large areas where no topic is to be seen.
Once I was here, I thought I should simply take all associations in the topic map and connect the topics on the map:
As you can see, most of them are binary relations, the association type symbolized by a black dot.
What is quite surprising is the highly radial nature (I did not expect this). The central topic "MapReduce" which dominates the mountain range in the center is connected to many peripheral topics. Only in isolated cases there are connections between peripheral topics.
Retrospectively, I should not be surprised: I have built the topic map exactly so: Starting with the central theme, I simply collecting and connecting individual software packages.
Still, the proximity of peripheral topics on the map lets me wonder whether I should not have specified more connections.
What I take home from all this is that the visualisation gives me quite some clues, where the content is too thin, both, in terms of collected documents and also in terms of connected topics.
But before I start another research session with Google I will invest some time in rendering topics (and assocs) inside the existing Seadragon interface.