Event Driven Autonomic Management – The Long Kiss Goodnight

(Part I – Preliminaries
Part III – The First Cut is the Deepest)

From my perspective, one of the major pitfalls in any project which starts out to produce a management infrastructure is that the project almost immediately starts focussing on the API layer rather than the defining the large scale system behavior. In many ways, this is completely understandable, given that the API has the most immediate impact on the first users of the system – i.e. those hapless fools that form the brigade of developers who have to integrate their systems into your management infrastructure. Given that in most large organizations, APIs become the mechanism that groups use to mediate their interaction – not just at the Java level, but in a visceral sense that governs the actual political interaction between the groups. Somewhat because APIs are something concrete and form a nucleus around which people can argue concretely about. But mostly it’s because most people are rather ignorant of how systems actually interact, but one thing they do know is that there are APIs and consequently these concrete manifestations of handles that can be universally understood become the battleground upon which system integration takes place. Or, put another way, APIs are the lowest common denominator that even managers can understand, consequently they are the only focus of pretty much every large scale project.
But the problem with this focus is that an API doesn’t define a system, rather it’s the other way around. The way I think about it is that the APIs of a system are like the inner core of a sphere. Defining the surface of the system – i.e. what the system “looks” like – will provide enormous leverage on the internals of that system. And this leverage will simply force things into place – meaning that the reason the API exists is because it is literally the inevitable result of the forces that keep the system together.
But I digress.


As I was saying, focussing on the API really has perverse effects on developing systems, and in something as sweeping as a large scale management infrastructure, this myopic focus is simply devastating.
So, what is the overall properties of the system we’re trying to model here? What is it that we’re trying to accomplish in the dynamic sense of system definition?
Luckily, the problem of management is pretty common so there’s lots of examples of real world systems we can examine to see what’s up. One resource I’ve found amazingly useful is the open source information repository found in infrastructures.org. Oddly, I’ve found that when I point this treasure trove out to people their eyes just kind of glaze over. Over the years I’ve come to judge the information content quality of a site (from my context definition of “good”) by how much a manager or high level project coordinator’s eyes defocus when I try to get them interested in the particular mine of information. Infrastructures.org is a sight which literally will cause any manager to start pulling out their phone and start texting random people in a desperate attempt to look busy. Even better, the “big picture” personnel will literally start walking out of your office when you start diving into the site itself, exploring the various nuggets of collective wisdom. So you know that it has to be a good site. Check it out, as it’ll be well worth your time, if you’re interested in this stuff.
The purpose of a management infrastructure is to automate the process of change. One of the desired effects of a successful system is that the investment that you have to make in terms of human labor goes down – i.e. you should have to dedicate far less people and have them spend far less time dealing with the system. I know this sounds rather obvious when you see it here in the ASCII, but if you simply look at the available systems out there, you find that – perversely – it seems like precisely the opposite has happened. Rather than reducing the amount of time someone has to dedicate to the system, what seems to happen is that the time required now increases polynomialy as more and more “functionality” is added to the system. Rather than simplifying the problem, the complexity grows with each and every feature as things interact in complicated and unforeseen ways. Instead of reducing the complexity of a very complex problem, the complexity increases.
Now, the absolutely stunning thing to me is that end users and administrators don’t simply revolt. But I guess there’s a certain macho aspect to doing system administration. An aura of “preisthood” and a realm of complexity that marks one’s manhood. Or it could be that Asperger syndrome is far more common amongst IT personell and they simply don’t notice the complexity because they are supermen and shrug off complexity like a duck sheds water.
But where you can’t ignore these issues is on the bottom line of your expenses. People cost money. And smart people who can herd the cats that you have in your data center cost a lot of money. So any time you have to inject a human into any process, you are spending a lot of money. Consequently, anything that removes humans from the process and allows you to leverage the smart ones you have more will pay you back handsomely.
So what we are looking for, from the high level view of the system, is something that drastically reduces the complexity in dealing with the incredibly complex process of managing a living and breathing system of perhaps 10’s of thousands of processes and all of their complicated interacting pieces.
Another mistake that I quite frequently find in the way projects approach system management is that they somehow believe that they have to solve the entire problem or they can solve none of the problem. So, for example, you see people literally start with bare metal provisioning and work their way up from there. Now I’m certainly not saying that we don’t need something – or some things – which cover the entire suite of issues, soup to nuts. But the problem with focussing on everything is that the problem really is quite stunningly huge.
It never ceases to amaze me in my interactions with smart people who are doing cool things, in that they simply don’t seem to comprehend that there is great value in not setting the initial bar to a height somewhere around the orbit of Pluto. I can’t tell you how many times I’ve had the conversation where I’m talking with a colleague about, for example, OSGi. When I said that he should take baby steps and simply make the system we were discussing a single, large bundle he kept asking what was the point. He wanted to immediately start dissecting the large system into a highly modular system with a correct factoring of interacting services.
When asked the question “How do you eat an elephant”, the correct answer is “bite by bite”.
It really is lunacy to think you can wolf down an entire elephant in one bite. And if you take it bite by bite, you’ll find that each bite is actually useful in and of itself. In the OSGi example with my colleague, I pointed out that simply having the system as a single bundle was actually a huge step forward. By having the system as a bundle, it was now something that came under the same provisioning mechanisms that all OSGi bundles are subject to. For example, I can start/stop/load/unload the bundle – something that we literally cannot do with the system in its current form. Further, we can now deploy the bundle to any OSGi container by referring to the location URL for that bundle. This means we now have a pull model for deployment which means we don’t have to worry about provisioning the physical directory of the running process which would host that process. These are non-trivial things.
And then look at where you now are with respect to the ultimate goal of having a well factored, modularized system composed of interacting services. If you now have the system running as an OSGi bundle, the system is now operating in an OSGi environment. Even if you do nothing with the system at this point, and merely leave it as a single bundle, any new functionality you add to this system – and believe me, this system is not static and is constantly having new functionality added to it – can now take full advantage of being in an OSGi framework. This means that you can stop using Singletons for your services and simply use OSGi services as god intended. Further, you can now provide the new functionality as OSGi bundles rather than pounding the original system, pushing the functionality into it as you normally would.
Taking a bite of the elephant and chewing it before you swallow and take the next bite has advantages even if the elephant is still there.
Again, sorry for the diversion, but the point was that by trying to solve the entire system management problem, we almost always paralyze ourselves into non-action because the problem is so huge that we don’t know where to start. Worse, because we’re trying to do everything we find that we do nothing well at all. The resulting system then ends up increasing the complexity because all of the parts of it don’t fit well together, haven’t gotten the design attention that they need and are poorly implemented because of the sheer size of the work to be done.
So that’s why I focussed on a limited domain in my research. As I think you’ll see, simply because I’ve limited the domain I’m applying the solution to, I haven’t actually limited the potential scope of the solutions that can be solved by the system. The system is general enough to do a whole host of things. However, but trying to solve specific problems in a limited domain, I’ve created an environment which can grow to solve large sectors of the problem in an organic and managed fashion.
The trick is to create something that can easily fit into existing mechanisms, procedures and tools that administrators already use to manage these complex systems. The idea is to take over some of the complexity of the existing systems and reduce that complexity without interfering with the rest of the system. So, while a good deal of the total scope of the global problem still remains, we have still made things better by reducing the overall complexity by taking a couple of bites out of the elephant.
Even better, we end up with a process which can be applied to the rest of the system (or a large sector of it) in a systematic fashion to solve the complexity remaining.
Bite by bite.

kiss_lips.jpg

One thought on “Event Driven Autonomic Management – The Long Kiss Goodnight

  1. Event Driven Autonomic Management – Preliminaries

    Update: Part II – the long kiss goodnight This is the first in a series of posts documenting the research I’ve been doing into a different way of thinking about system management infrastructure. For quite some time, I’ve been obsessed…

Leave a Reply