High-Definition Semantic Maps (Part I)

This is my first stab at a realistic data set (see the attachments for the original resolution):

http://kill.devc.at/system/files/mr-wo-docs-small.jpg

It shows the landscape around the theme MapReduce, a cloud computing technology about which semantic web people may or may not have heard. In either case, the landscape tries to paint an intuitive picture of the involved topics:

  • MapReduce, the computing principle;
  • Hadoop, the Java implementation of MapReduce, and also satellite technologies such as hadoopdb, pig, dumbo or Mahout;
  • Cloud computing as offered by Amazon (EC2), Yahoo or Caldera and the likes;
  • How Google (which promoted the paradigm first) fits into the picture (sic!);
  • And the many other software packages which implement the MapReduce processing method.

If your favourite software package is missing, then either (a) I have simply missed it recording in my semantic network, or (b) there was not enough text information for the machinery to push the topic onto the surface. Or (c) it could also be that the map resolution chosen for this demo does not allow that degree of detail.

What This Map Shows

Like for natural landscapes, the areal extension of a certain topic is a direct measure how relevant that topic is in relation to the whole.

The mapreduce range in the south-east (SE) corner occupies around 1/8th of the overall real-estate implying that the topic has about that importance within the considered semantic network. Actually, this mountain range is only covering predominantly "MapReduce, the computation paradigm". "MapReduce as used and seen by Google" is a different aspect which is located at the north edge. Also the aspect "MapReduce software" is separate (around the north-east).

The color coding should symbolize topic intensity. Revisiting the south-east (SE) corner mapreduce range again, then obviously there is much content covering predominantly that very topic (and nothing else). Content here meaning:

information within the semantic network and the documents that it mentions.

The alert reader will notice that the landscape will wrap around the edges: Follow it south, and it will continue in the north. Follow it east, and in the west it will connect (torus topology).

A Virtual Semantic Tour

Let's look closer at the map:

The SE montain range labelled mapreduce is covering mostly conceptual material, many blog entries introducing into the basic idea behind the paradigm and cloud computing per se. Towards the south the relationship between CC and conventional databases becomes more emphasized.

Following the ledge northwards will lead to the Hadoop shoulder. Documents there will be Hadoop's own documentation, tutorials and experiences. Further northwards these experiences will be more about operational issues and cloud computing in general, be it with Amazon's Elastic Cloud or using Cloudera (not much about Yahoo here, yet). A few documents also cover performance issues and are therefore visible there.

The massive mapreduce block in the north-west is Google's position. Unsurprisingly, one will find the discussion regarded related patents there. Separate from the Google influence sphere there is Microsoft's Dryad within the plain. Obviously Dryad's importance pales compared to the rest, both, in terms of real-estate and document intensity.

Around the equator you will find many more hills concerning with rather small software packages, some satellites to Hadoop (such as HadoopDB, HDFS or Hive), some more separate such as Pig or Gisting. I have not much information collected about most of these. Hence their low intensities in the map.

And The Documents?

What is possible, is to position documents where they would fit best into the landscape (again, better look at the larger versions below):

http://kill.devc.at/system/files/mr-wi-docs-small.jpg

Each little circle represents one document whereby the circle size corresponds to the impact that document has on the landscape.

Displaying document details is yet unfinished business.

AttachmentSize
mr-wo-docs-small.jpg23.5 KB
mr-wi-docs-small.jpg25.85 KB
mr-wi-docs.png1.97 MB
mr-wo-docs.png1.88 MB
Posted In

Cool!

I can't stop thinking how cool these things are, but a few questions;

What do you do when there are many, many more documents? Are there some break-off points for when these maps goes from sane and useful into chaos and pain? Can they be zoomable? (Obviously with a different technology back-end)

Also, the colors to me look a bit funny where you've got greeny hills and brown plains (which is fine if you live in the middle-east). Any other colors and option you've played with?

Alex (not verified) | Mon, 03/01/2010 - 00:54

Re: Questions

What do you do when there are many, many more documents?

Buy a bigger machine :-)

But in terms of the map, it simply means that the surface granularity will be higher. And that the map object itself can be bigger. And it is trivial to ...

Can they be zoomable?

... implement a zooming mechanism. Is on my TODO list :-)

Are there some break-off points for when these maps goes from sane and useful into chaos and pain?

Actually it is the other way round: If you zoom in, the more boring the maps become. Must become according to the way I compute them.

That is actually the perfect place to render more document details. We here have outrageous ideas regarding that....

Also, the colors to me look a bit funny where you've got greeny hills and brown plains ...

Ok, coloring is a "cultural problem". I already realized that.

Actually I was approaching this from my middle-european view point: Dark-brown = high. But "coloring" is an open issue.

Keep asking. It really helps me here.

rho | Mon, 03/01/2010 - 13:07

More questions

Ok, that's cool. So let's dig in a bit deeper. I'm myself currently in the throngs of making a generic Topic Maps viewer (only sexier than what currently is out there :) that would be useful to most maps. In doing so you stumble upon the whole size-of-map vs. size-of-view conundrum; how do you cluster and package similar but importantly distinct topics and associations in a reasonable fashion? A small map with, say, 2000 TAO's (Topics, Associations, Occurrences for the newbies :) will necessarily be viewed differently than a 20M TAO's one.

The follow-up to that one is what kind of maps are reasonable to expect? This is a mix between the type of customer and the ontology in question, the type of collected data (granularity, parity, clustering, thresholds of interest, etc.), time (temporality), and so on. I am myself trying to think of better ways to figure out what people would like to see in these maps, and I realise that a *lot* of your work must be doing these things, too. Are these parameters tweakable? (I do understand we might be getting into the "why my application is better than yours" territory here :)

One thing that keeps haunting me is that the mixing parameters change all the time, both in keep with the contents of the maps, but also the context, my moods, my needs, and so on. Maybe one could develop a smallish ontology that deals with some of these things? (So, I change parameters in my ontology, and all my ontology-aware maps change accordingly)

Oh, and have you thought about doing this kind of maps for email? I think that would be pretty kick-ass!

Alex (not verified) | Tue, 03/02/2010 - 02:35

Re: ad clustering

.... how do you cluster and package similar but importantly distinct topics and associations in a reasonable fashion?

The short answer is taxonomy, implied or explicit. The way I compute the vector space will try to cope with the dimensions as best as possible.

This, of course, shifts the burden on selecting the proper dimensions of the vector space. This is where additional analysis kicks in.

rho | Tue, 03/02/2010 - 11:08

Taxonomy

Yes, "as best as possible", but what the heck *is* that? :)

Do you deal with taxonomies that are huge or small? (I guess here I'm asking if you're using some pre-existing upper ontology) And if huge, do you only visualize parts of it as TAO's hit the proximity, or does artifacts from the taxonomy that aren't really part of your data sneak into it?

(And I'll try to follow your new crazy regime of a separate answer to each question ... :)

Alex (not verified) | Tue, 03/02/2010 - 11:21

Re: Taxonomy

Yes, "as best as possible", but what the heck *is* that? :)

This is difficult to answer without mathematics. And the water is further mudded by shortcuts I take to keep the processing time low.

Do you deal with taxonomies that are huge or small? (I guess here I'm asking if you're using some pre-existing upper ontology)

Very small taxonomies, and these are shallow and completely ad-hoc. Not, that you cannot use something more educated, but to my surprise the aggregation is already strong with shallow type systems.

And if huge, do you only visualize parts of it as TAO's hit the proximity, or does artifacts from the taxonomy that aren't really part of your data sneak into it?

From the way I compute it, the whole ancestry of an instance/type would have an impact. So if you had a deep ontology, every supertype would would have an influence.

Currently I use entropies and a Gaussian window, so that top level types have only little influence. Which makes sense, because I do not want to end up with a landscape labels "Thing" all over the place :-)

rho | Wed, 03/03/2010 - 09:48

Re: ad size scalability

A small map with, say, 2000 TAO's (Topics, Associations, Occurrences for the newbies :) will necessarily be viewed differently than a 20M TAO's one.

But also a mega map (such as Wikipedia) will have an overall structure. So from an interface interaction view, the map will look all the same. It's just the planet as seen from space. But you can zoom in. Conceptually easy, although a bitch to do properly programming-wise and data-management wise.

rho | Tue, 03/02/2010 - 11:09

TAO's

But can you even do a smooth zoom with these computational intensive maps? (I guess you could vectorize them, and pop them through some SVG / Flash thing, or pre-render a few million bitmaps of the thing ... hey, maybe even just use the Google Maps API and feed in the bitmaps that way)

As to the structure, yes there will be similarities. I guess the question here would be if the maps zoomed out makes sense at whatever scale you're at. Zooming in on a topic is fine as you focus, but when your map is reasonably unfocused, will it be helpful?

Alex (not verified) | Tue, 03/02/2010 - 11:25

Re: TAOs

But can you even do a smooth zoom with these computational intensive maps?

No, not yet. But I have a plan. :-)

As to the structure, yes there will be similarities. I guess the question here would be if the maps zoomed out makes sense at whatever scale you're at. Zooming in on a topic is fine as you focus, but when your map is reasonably unfocused, will it be helpful?

Here I take the position that the map parallels what the topic map is about. If that topic map contains all sorts of things, so will the visualized map. If the topic map is about MapReduce, then so will be the map.

Total author control.

rho | Wed, 03/03/2010 - 10:08

Re: ad ontological seeding

The follow-up to that one is what kind of maps are reasonable to expect? This is a mix between the type of customer and the ontology in question, the type of collected data (granularity, parity, clustering, thresholds of interest, etc.), time (temporality), and so on.

You already gave yourself the answer. The idea is that all this is controlled by the (amount of) ontology you put into that thing. That most certainly depends on the use case: For a company intranet you will probably invest more than for an "open internet" thing like the mapreduce example.

I am myself trying to think of better ways to figure out what people would like to see in these maps, and I realise that a *lot* of your work must be doing these things, too.

You will notice that the Topic Map aspect of the map is not visible to the user. In contrary, I actively push away from the paradigm as I do not believe in the subject-centricity of things. I believe in subjects in context, like the mapreduce.

rho | Tue, 03/02/2010 - 11:10

Subject-centric

Oh, I agree, screw Topic Maps; those things are terribly out of fashion, no one uses them anyways, and who the hell cares about the technology that may or may not run things (except geeks; geeks care)?

But I also think there's room for *both* subject-centricity and subject contextual clustering. Some topics are interesting all by them selves, while others are only interesting in context. Heck, this is what I struggle with half the time, and I suspect everybody else who's doing TM technology that cares about being generally useful. (And if you aren't generally useful, go play with the other RDBMS kids instead ...)

Alex (not verified) | Tue, 03/02/2010 - 11:31

Re: Subject-centric

Oh, I agree, screw Topic Maps ... and who the hell cares about the technology that may or may not run things ...

Yup, this is internal.

But I also think there's room for both subject-centricity and subject contextual clustering. Some topics are interesting all by them selves, while others are only interesting in context. Heck, this is what I struggle with half the time, ....

I agree. But I am too lazy (and too pretty) to model this context sensitivity by hand in my topic maps. I rather prefer to seed the process with my topic map and let the machine do the walking.

rho | Wed, 03/03/2010 - 10:14

Re: ad parametrized customization

Are these parameters tweakable?

Unfortunately, very much so! :-)

This is actually one of the bigger problems for which I have incomplete solutions: What is the proper dimensionality? What are the conceptual features with which to populate the vector space? How should the convergence behaviour be (slow/smooth vs. fast/edgy), etc.

At the moment my configuration space has around 30 axes, whereby some have an internal structure as well.

rho | Tue, 03/02/2010 - 11:11

Re: ad working contexts

One thing that keeps haunting me is that the mixing parameters change all the time, both in keep with the contents of the maps, but also the context, my moods, my needs, and so on.

Here my working hypothesis is that (a) working contexts MUST emerge from the map itself, and (b) that we humans overlap heavily in these contexts.

Look at the MapReduce thing. At one point you are in the context

"What is that MapReduce thing all about?"

Later you wonder:

"What can I do with it?"

Then, even later:

"How can I solve problems with Hadoop?"

Then, when fed up with Java, you may ask

"Are there any non-java alternatives?"

and so on.

If your working context do not emerge from the document corpus, then there is something wrong with the corpus. Or with you :-))

rho | Tue, 03/02/2010 - 11:12

Context-driven

"working contexts MUST emerge from the map itself"

Yes, they must, otherwise you wouldn't be generally useful, I think, and for most TM tools out there this is sadly lacking. There is a sad lack of serious and exciting analysis of the maps and contexts within.

I'm currently playing with gaining clustered contexts through counting hops between proxies that seem important (where "important" is a bit of black magic involving counts, type-structures and occurrences shared by clusters) and a slightly fuzzy upper ontology. Man, I wish I had more time, not to mention better math skillz. How do I distribute a cluster of vector points over a circle and calculate their "kinetic" distance and animate them such?

Have you done any analysis over the WikiPedia corpus yet? I started on the abridged XML export (1.2Gb or so) but got impatient with my SAX implementation and never got any further. I reckon a lot of researchers must be doing this, no?

Alex (not verified) | Tue, 03/02/2010 - 11:38

Re: Context-driven

I'm currently playing with gaining clustered contexts through counting hops between proxies that seem important ....

This is exactly what I do here when it comes to analysing the topic map topology. I do not count the hops, though, but use a metrics based on the entropy of the association type.

How do I distribute a cluster of vector points over a circle and calculate their "kinetic" distance and animate them such?

Kinetics would involve movement. I shy away from having to many things move in an interface. It confuses me. But when it comes to spring models (very popular), then there are implementations in pretty much every programming language.

Have you done any analysis over the WikiPedia corpus yet?

Not yet, but I have concrete plans to attack that in the near future.

I started on the abridged XML export (1.2Gb or so) but got impatient with my SAX implementation and never got any further. I reckon a lot of researchers must be doing this, no?

Most probably. But I do not go the XML Wikipedia path. I rather would prefer to download a DBpedia excerpt in CSV!

rho | Wed, 03/03/2010 - 10:22

Re: misc

Oh, and have you thought about doing this kind of maps for email? I think that would be pretty kick-ass!

Maybe, but I still need a bit of ontological backboning. Whether the IMAP folder structure is enough, I have never tested.

Meta: Boy, that really helped... Thanks a ton!

rho | Tue, 03/02/2010 - 11:12