HD Maps: And Where Are The F*&*() Topics?

(continued from Part III)

Now that I know where the documents are located in the landscape, I have experimented with ways to estimate where the topic map topics are supposed to be. My hypothesis is that if I can determine the distance of each document to every topic, I can triangulate the topics.

Topic Positions

Below (larger version in the attachments) is a new rendering of the MapReduce theme:

http://kill.devc.at/system/files/x4-small.jpg

It shows the themes derived from the semantic corpus (documents + semantic network). Compare this with the positions of topics:

http://kill.devc.at/system/files/x5-small.jpg

(admittedly the rendering is not pleasing: text overlaps, texts are truncated, the color sucks, ...)

In many cases the derived theme and the topics at that location correspond nicely. Look for instance at "Holumbus", or "Cloudera", or "HDFS", or "Dryad".

But there are others, such as "Pig", which is quite off, i.e. where the topic is dragged into a different area. I have already a theory why this is so.

Others are completely off, eg. "DryadLINQ", or "Dumbo". With these I believe to have far too little content to arrive at plausible results.

What is certainly interesting is that there are large areas where no topic is to be seen.

Association Positions

Once I was here, I thought I should simply take all associations in the topic map and connect the topics on the map:

http://kill.devc.at/system/files/x6-small.jpg

As you can see, most of them are binary relations, the association type symbolized by a black dot.

What is quite surprising is the highly radial nature (I did not expect this). The central topic "MapReduce" which dominates the mountain range in the center is connected to many peripheral topics. Only in isolated cases there are connections between peripheral topics.

Retrospectively, I should not be surprised: I have built the topic map exactly so: Starting with the central theme, I simply collecting and connecting individual software packages.

Still, the proximity of peripheral topics on the map lets me wonder whether I should not have specified more connections.

Quality Assurance

What I take home from all this is that the visualisation gives me quite some clues, where the content is too thin, both, in terms of collected documents and also in terms of connected topics.

But before I start another research session with Google I will invest some time in rendering topics (and assocs) inside the existing Seadragon interface.

AttachmentSize
x6-small.jpg37.86 KB
x6.png1.93 MB
x5-small.jpg34.02 KB
x5.png1.22 MB
x4-small.jpg33.78 KB
x4.png1.73 MB
Posted In

Triangulate?

Can you say a bit more about "triangulate?"

If you know the distance between a document and a topic, what is there to "triangulate" about the topics position?

Or do you mean to discover topics/subjects based on relationships between documents that mean at some intersection of relationships, there should be a topic?

Thinking that sounds hard because intersections could be arbitrary artifacts as well as meaningful ones.

You might want to take a peak at the Search User Interface book that I am working through at my blog. Research on iconic representation of documents isn't favorable. But, perhaps you will discover a new approach.

Patrick

Patrick Durusau (not verified) | Sun, 05/09/2010 - 21:19

Re: Triangulate?

Can you say a bit more about "triangulate?" If you know the distance between a document and a topic, what is there to "triangulate" about the topics position?

I know the positions of the documents, but not those for the topics.

But if I start with the document positions, derive the knowledge to which topics each document directly (and indirectly) belongs to, and also how (subject identifier, locator or occurrence, and then each type and its entropy within the map), then, well then I can compute a resultant vector for each topic. And that vector I find in the landscape at a certain position.

If you now think "mucho, mucho computatione", then you are spot on!

rho | Mon, 05/10/2010 - 11:24

Computing Resultant Vector

OK, so each document that directly or indirectly "belongs" to a topic contributes to the resultant vector for that topic? That sounds intense.

My mistake was thinking your were calculating a document vector. You are computing a topic vector that is composed of document vectors. And then locating that topic vector on the map. Yes?

A fair warning to the unwary reader, I assume by "entropy within the map," you mean: http://en.wikipedia.org/wiki/Topological_entropy.

I still need to work on a kitchen mixer type explanation for the TMRM. See: MapReduce Explained. Then maybe I can turn to your latest adventures. ;-)

Patrick Durusau (not verified) | Thu, 05/13/2010 - 01:32

Re: Computing Resultant Vector

OK, so each document that directly or indirectly "belongs" to a topic contributes to the resultant vector for that topic? That sounds intense.

It is. Especially since every document contributes to every topic, albeit with different strengths.

My mistake was thinking your were calculating a document vector. You are computing a topic vector that is composed of document vectors. And then locating that topic vector on the map. Yes?

Exactly.

A fair warning to the unwary reader, I assume by "entropy within the map," you mean: http://en.wikipedia.org/wiki/Topological_entropy .

I have to answer the question:

  • How much information does this association or that occurrence contribute?

So I am only computing the entropical value of association types and occurrence types. And use that was weight.

When I have a bit more time, I try something more sophisticated.

rho | Thu, 05/13/2010 - 10:37