So, this morning my alarm clock was inexplicably set to the central time zone and consequently I arose 2 hours before my regularly scheduled time. Unable or unwilling to go back to bed after my morning “get ready for work” rituals, I settled down to read a few articles from my RSS feeds. The first one I read was a rather odd post on Distributed OSGi – Tilting at Windmills by Roger Voss. Laughing a bit as I read the opening paragraph, I didn’t think much more about the post as it really had nothing to do with what the OSGi RFP 119 “Distributed OSGi” really is about. Fortunately or unfortunately, several people started peppering me with tweets, IMs and emails asking if I had read the post in question and what were my thoughts about it. So, here they are.
Basically, I’m not really going to defend the OSGi EEG RFC 119, “Distributed OSGi” as my own interest in the matter rested largely with the acquisition of the so called “registry hooks” which allow infrastructure developers such as myself to hook into the queries of the OSGi service registry and do cool things like manifest services on demand. Once this capability was present and decoupled from the 119 RFC, I felt I had all the tools I needed to do any damn thing I wanted to, regardless of what the 119 RFC was doing. (for background as to how I fell in love with the idea of the registry hooks, see my posts on remote OSGi, which predated the RFC 119 here and here and a post during the formulation of the RFP which led to RFC 119 here)
(Part I – Preliminaries, Part II – The Long Kiss Goodnight)
<sigh> Sorry. I tend to talk more about the surrounding atmosphere than the thing itself. In this post I hope to remain focussed and actually discuss the actual meat of the architecture and the skeleton upon which it is based. Apologies for not providing the color and fluff of life that actually surrounded the process
Point 1 Humans are good at making declarative statements and woefully incompetent at micromanagement.
This is one of those points that one shouldn’t have to make, but the simple fact that bears repeating is that Humans are really good at figuring out what should be done, but really shitty at actually – you know – doing what they think should be done. In keeping with the spirit of the theme of this post, I leave it to the reader to reflect upon the profound reality that it is far easier to see the goal than the path to it. The profound insight I have (not singular, as I’m sure you’re aware) is that a management system which caters to the micro-manger who actually is competent at orchestrating a complex series of transformations under chaotic conditions are few and far between – so rare as to be non-existent to several orders of approximation (or so expensive, which amounts to the same thing).
The take away is simply that any system which depends on a human to do the actual work is simply not going to work – by definition. Seeing as how this is one of my premises, it’s not something I can really argue. It’s a premise derived by years of observations of not just other humans but also myself. Again, I’m not making the claim that there are not humans who are absolutely brilliant at micro-managing large scale distributed systems – let’s be crystal clear on that point. No. My point is simply that these people are incredibly rare and you – the actual person paying the price – will have to pay through the nose if you find such a person. And, quite frankly, the chances of you actually finding such a person is so miniscule as to be almost unmeasurable. Most likely, what you’ll do is find someone who claims to be such a person or someone whom someone you trust claims to be such a person. And the odds are overwhelmingly that you are just a complete maroon and have been hoodwinked into paying a lot of money for a cheap imitation of such a being. Get used to it. It’s just a simple fact of reality.
When you’re trying to build a massively multiplayer online gaming platform (MMOG), probably the most important part of the system is scalability. After all, if it doesn’t scale, it’s simply a multiplayer online gaming platform – without the “massive”. While it almost seems embarrassing to point this out, it’s extremely interesting to note that there have recently been a lot of discussion about scalability of online systems – in particular, the Web 2.0 applications. I won’t point to these discussions, but suffice it to say that I find it terribly amusing to hear the various forms of the argument that you can worry about scalability later – i.e. it’s not something that has to be designed in from the start. (Arguments of the form “don’t worry about scalability because no one is going to use your application anyway” are perfectly fine, however). As the history of MMOG has shown, the application’s architecture has a huge impact on the ability to scale. As many gaming platforms have discovered, scalability isn’t something you can simply “add on” after you “get things right”. Anyone who thinks that this doesn’t apply to other network application architectures amuse me to no end, given as if they actually produce something of value, it will fall over when it hits the natural scalability limit of their crappy architecture.
In any event, there’s a couple of basic problems with MMOG that limit scalability. The first has to do with what is known as “Area Of Interest”. The idea here is familiar enough to anyone who has done any distributed communication in that the gaming platform doesn’t want to find itself in an N2 connection topology. In MMOG, the entities (gamer avatars, NPC, etc) have to communicate with other entities in the game. If you can’t find a way to limit the communication to the entities in the area of interest – i.e. the other entities that the entity in question is limited to communicating with – then you have a huge scalability issue due to sending messages to entities that simply don’t care about the communication because it can’t possibly affect them. This not only wastes bandwidth and precious OS network resources but causes a host of other issues having to do with the time ordering of distributed events and filtering our events that aren’t relevant. It’s a mess.
(Part I – Preliminaries
Part III – The First Cut is the Deepest)
From my perspective, one of the major pitfalls in any project which starts out to produce a management infrastructure is that the project almost immediately starts focussing on the API layer rather than the defining the large scale system behavior. In many ways, this is completely understandable, given that the API has the most immediate impact on the first users of the system – i.e. those hapless fools that form the brigade of developers who have to integrate their systems into your management infrastructure. Given that in most large organizations, APIs become the mechanism that groups use to mediate their interaction – not just at the Java level, but in a visceral sense that governs the actual political interaction between the groups. Somewhat because APIs are something concrete and form a nucleus around which people can argue concretely about. But mostly it’s because most people are rather ignorant of how systems actually interact, but one thing they do know is that there are APIs and consequently these concrete manifestations of handles that can be universally understood become the battleground upon which system integration takes place. Or, put another way, APIs are the lowest common denominator that even managers can understand, consequently they are the only focus of pretty much every large scale project.
But the problem with this focus is that an API doesn’t define a system, rather it’s the other way around. The way I think about it is that the APIs of a system are like the inner core of a sphere. Defining the surface of the system – i.e. what the system “looks” like – will provide enormous leverage on the internals of that system. And this leverage will simply force things into place – meaning that the reason the API exists is because it is literally the inevitable result of the forces that keep the system together.
But I digress.
Update: Part II – the long kiss goodnight, Part III – The First Cut is the Deepest
C’mon in! The water’s wonderful!
This is the first in a series of posts documenting the research I’ve been doing into a different way of thinking about system management infrastructure. For quite some time, I’ve been obsessed with the idea of how to simply and effectively manage large scale systems. Throughout this obsession, I’ve travelled down various roads and found myself in several box canyons along the way. I’ve tried out a lot of different strategies and have finally settled into something which provides the kind of framework I’ve been looking for which I haven’t found replicated anywhere.
Note that I’m certainly not making the claim that it is “Teh Best” management infrastructure. Rather, what I’m making the claim is that it’s the most interesting management architecture to me. As anyone who knows me can testify, I have rather peculiar tastes and I am a strange bird at times. So fair warning, eh?
In any event, what I plan to do is to provide a fairly deep dive into the architecture that I’ve come up with. In the standard tradition of all literature scientific and technical, it will be presented in precisely the opposite order in which I actually came up with things – i.e. from the top down, in a semi coherent form that makes sense. Lord knows that actual discoveries and explorations are more a matter of luck in which you discover something and then spend an inordinate amount of time tracking down why the heck you managed to stumble upon it and where it fits in the larger picture of things that you’re trying to map out. I’ve always found this cognitive dissonance amusing, myself, and hope you won’t mind to much when I veer off into seemingly irrelevant paths rather than sticking to the point at hand.
If you’re one of those people who can’t wait until the end of the story to find out what’s going on, by all means download the PDF of my talk on the subject at last year’s Spring Experience entitled Digging the Trenches on the Ninth Level. If you’re not familiar with Dante’s Divine Comedy, then you won’t get the joke. But suffice it to say that I’m a big believer in the principle that every time you solve a problem, you discover ten more problems that you didn’t know you had.
A perfect example of this sometimes perverse law is something as simple as email. Email solved a lot of problems that a modern economy and social population have, but in doing so it created a lot more. Without email, we would never have been subjected to the sublime beauty of penile extension spam nor would your grandmother be subjected to the horror of id phishing which you discover has snagged her bank account and drained all her life’s savings leaving you with a predicament that makes you wonder what all this progress was supposed to do in the first place.
Likewise, I firmly believe that in solving the problems I believe have been addressed by OSGi, Spring DM and management architectures like mine, we’ve inadvertently unleashed new levels of horror that will ensure future generations will curse our names as they suffer from the fall out and live the unspeakable abominations unleashed from these “solutions” and witness them unfold in ways that we couldn’t possibly imagine.
So, with that cheery panorama as the back drop, I’ll end this introductory post and start working on the next post, which provides high level overview and ten dollar tour of the sewers that I’ve been digging for your benefit on the ninth level of hell.
Remember. I dig because I care. After all, you do want that frozen crap to be routed somewhere and dealt with, don’t you?
My talk at last year’s Spring Experience talk on the next generation of application server architecture is available here.
The talk is about OSGi and how the next generation of application server platforms will simply do away with the cumbersome and rather dated component models that we all know and hate in favor of the vastly superior OSGi platform. Or that’s the theory at least. Only time will tell if I’m correct or just another mad hatter sniffing too much mercury outgassing from the various toys littering his office.
In addition, I also lay out the management architecture I’ve been experimenting with for the past year. Obviously, it uses OSGi as its base, but OSGi – by itself – isn’t sufficient to provide the kind of management infrastructure you need to manage large numbers of processes. I call this management architecture – for lack of a better name – Event Driven Autonomic Management. I’ll be kicking off a series of posts going into far more detail on this architecture as a means of documenting the research I’ve been doing.
Think of it as therapy, as talking about it on this blog – posting, so to speak, to the wind about concepts and issues that no one else seems to find terribly interesting or useful. You can tell I’m a great hit at parties, can’t you?
I’ve been having a great time working on my MMOG framework, 3rd-Space. Since December, I’ve been focussing on the actual event driven simulation framework, as this framework is key to the system’s performance and scalability. The work is based on the simulation framework described by Rimon Barr in his thesis, An Efficient, Unifying Approach to Simulation Using Virtual Machines – something he calls “JiST” (Java in Simulation Time). It’s a very cool event driven simulation framework that uses Java, itself, as the simulation scripting language. He turns the Java classes into a simulation by transforming the Java classes using a byte code rewriting framework. The result is a very easy to use, completely type safe simulation framework that completely blows the doors off the closest competing C++ event simulation frameworks in terms of raw performance and scalability. The performance is so good because the framework uses the Java virtual machine as the engine for running the simulation.
So, as many of you may already know (and if you don’t, then you’re way out of date, dude), Oracle acquired the Tangosol company and their Coherence product some time ago. Over the last half year, I’ve been working on some very cool stuff related to the whole data grid aspect of Coherence which I must say is pretty much the most bitchin’ stuff I’ve done in quite a while.
Well, that was until I started my hobby project.
As you may know, the OSGi Enterprise Expert Group is currently in the process of defining the interaction with “external” systems and formalizing the idea of what it means to have a “distributed” registry in OSGi. Having been through what seems like the same process now for the third time (EJB, SCA and now this effort), the overriding issue seems to be that distributed systems “are not like local systems”. One would have thought that this nugget of wisdom would be deeply lodged in the DNA of anyone who’s done anything with distributed systems for more than – say – a bit of example code found on The Server Side, but sometimes we do have to bring what should be obvious facts to the fore and illuminate them with Klieg lights so that the current, next and last generation who hasn’t figured this out yet will.
Amongst the things that are often brought up in these discussions are things like asynchronous messaging, marker interfaces, and – of course – the holiest of holies: METADATA. Now, naturally, I do agree with all of this. Asynchronous message patterns are darn important in not just distributed computing, but in “local” computing as well – so important to me are these patterns that I actually wrote a framework called Anubis for doing this and it’s something that literally provides a great deal of the foundation of the work that I do on a daily basis. Ditto for METADATA and all that jazz…
But what I believe should be obvious to people and for some reason seems to be not so obvious, is that all these things have absolutely nothing to do with OSGi. None. Natta. Zip. Zero. And what’s really odd to me is that when I say things like that, people believe that I’m saying “things like asynchronous messaging, marker interfaces, METADATA, etc, are not important”. Naturally, because of this misunderstanding, I spend 99% of my time trying to point out the fact that – yes – I actually do believe these things are of crucial importance and that – yes – these issues are necessary to resolve and come up with good solutions.
The point that never seems to get across is that I believe these issues are orthogonal to OSGi.
A couple of people have asked about the differences between the implementation I have for a distributed OSGi framework and the framework described by Jan Rellermeyer (see my previous entry on the subject). In a nutshell, the primary difference is that the way services are advertised and discovered is completely hidden in my implementation. The implications, though, that this small difference has on usability and integration with existing OSGi frameworks is rather profound.