TM::IP RESTful in Peace
I had mentioned earlier that have now reorganized my new TM server (based on Catalyst/mod_perl/Apache) along the REST paradigm. In my case this means that not only TM data, but also documents attached to it, vector spaces, and so forth are exposed RESTfullish.
At first this appeared to be more RESTfoolish as it was quite difficult to squeeze everything into a GET/PUT/POST corset. And it also was much more work than I had planned to invest, mostly because not only the original resources, but also all machine learning processes have to be exposed, and if it is only their configuration parameters. And they have plenty.
But I seem to have reaped the benefits much earlier than anticipated. Read on.
My web application involves that users can manage their maps, affiliated documents and all sorts of analytics over the semantic space the maps cover.
Some of theses analyses are instantaneous, others take a minute, others may take hours. Definitely not a time a user is willing to spend in front of a rotating spinner inside the browser window. So these processes have to be backgrounded and - if possible - routed to an idle server somewhere in my huge server farm (I have now 2!).
Bring Data to the Worker
The nice thing about the RESTfullness of my resources is that they are available everywhere where a worker requests them (ship the data to the computation). This is feasible as the data itself is quite small.
And when a worker is finished, it simply sends the results back to the 'central' web server, using either PUT or POST, as appropriate.
One complication I have is that the update of one resource A, say a map, not only triggers an update of resource B, say the map corpus. When that is done, also all depending vector spaces and eventually the mapscape (a landscape of the map) has to be recomputed.
This all resembles what you model with Makefiles or ANT. But here some of these computations can be done in parallel, others must be sequential.
The way I solved it is to
- capture the dependency patterns in rules over URLs, and
- whenever a resource is updated, to use its URI to match against these rules and expand a Petri net.
The dependency patterns make heavy use of the newly available regular expressions of Perl 5.10. Not for the faint-hearted. The nice and simple thing about these rules is that they only need to look at the URL. No further context exists. Or can exist.
Whenever one worker makes an update or the user changes a configuration, the dependency rules are consulted, the Petri net is updated (and if it is only the time stamp of last modification), and then a new set of parallel computations is launched. Quite beautiful.
And all the control resides in a Catalyst end controller with around 20 lines.
The Real World
While this MUST theoretically work, things tend to break: Workers dawdle or fail, their responses disappear, job servers crash, networks split, mother-in-laws come for visit, etc.
To find a job queuing and distribution system with Perl took me a whole week. There are many, but each of them has its problems. After quite some testing I ended up using an unlikely suspect: Gearman.
- no worker management (you have to keep them alive yourself),
- no persistent job queue (if the job server dies, all submitted jobs are lost),
- no way to kill already running jobs.
- no way to query the status of jobs
But OTOH it is extremely light-weight. :-)
Amazingly, the REST-dependency algorithm survives now every imaginable combination of failures. At least this is what my torture test suite tells me now.