Well, that was fun. I just spent a decent chunk of my weekend swapping out the fundamental mechanism in the Prime Mover event driven simulation framework. Between taking care of the little one and the massive allergy attack I was suffering (thanks, Mom Nature!), I think I should get an award or something. Wait! Here’s a gold star I can put on my laptop case.
In any event, the changes to the underlying framework are actually quite cool. What had happened was that Prime Mover was finally getting used in anger – well, gently played with, really – and this exposed some serious short comings in the data flow analysis I was using to perform the byte code rewriting which made the magic happen under the hood. Over a few beers at Bucks on Friday afternoon, we mulled over a few different strategies for fixing the issue – none of which were particularly appetizing.
So, let me step back a bit and lay out what happened to the framework and why.
At the heart of the Prime Mover framework is a byte code rewriting system which allows Java to be used as the scripting language for event driven simulations. Now, event driven simulation frameworks are pretty cool and are very useful in modeling as well as gaming (as an aside, I find it terrifying to witness the complete lack of simulation frameworks in most gaming systems – i.e. they all focus on the graphics and the behavior is a distant second thought). The basic idea is quite simple. You have simulation entities which can only communicate through the exchange of events. In the simulation, only the events are processed, which means that eons of simulated time can pass in microseconds if there are no events to be processed. This is why modelers of all disciplines (from business, to economists, to physics to wireless network engineers) find event driven simulation so darn useful.
However, the overriding problem in all these frameworks is the rather ugly way that these frameworks are exposed to the end user – i.e. the modeler. Usually, some bizarro scripting language is created and that is interpreted. Other ways of providing the framework is to create a simulation “kernel” and expose this to the modeler in the form of “posting” or “subscribing” to events. Pretty darn ugly – at best – and subject to not just a lot of confusion, but nasty issues with typing and above all efficiency.
So, as I described previously, Prime Mover does away with all that using the clever idea of making use of the underlying Java VM as the driver for the event framework. This paradigm for event driven simulation turns out to be precisely the same kind of stuff that many of us have been doing for a while. The underlying idea of using proxies and reflective calls has been around since the Smalltalk days of OO – and I’m sure before. The idea that void return methods are excellent models for one way events is also something that has been around since the dawn of time in distributed communication. The only thing that’s really different in this modeling paradigm is to use these techniques to implement event driven simulations rather than distributed communication architectures.
The advantages to event driven simulation are analogous in that you now have what amounts to a transparent system that simply runs your code, albeit with some different semantics due to events being processed asynchronously from the caller – something quite typical in distributed systems.
In any event, the infrastructure that I had developed had followed the lines used by the JIST framework in that I analyzed the byte codes and created type substitutions for the actual references to the simulation entities and rewrote the call sites at the event sends in the calling methods. Granted, doing this was a herculean feat demonstrating my mad programming skillz, but the problem is inherently difficult and in many common cases poses an insurmountable barrier to success.
At issue is that in order to make these byte code transformations, I had to do an awful lot of data flow analysis. I was using the data flow analysis that’s provided in the ASM framework and this allowed me to do some very precise intra-method data flow analysis. But it was quite clear that in order to do the problem “right”, I would have to do some pretty gnarly cross-procedural call analysis that I couldn’t do with the ASM framework. Consequently, I was looking into data flow analysis frameworks (short conclusion: there aren’t very many of them available) and reading reams of papers on the black art of code analysis. I had basically come to the conclusion that it was going to be frickin’ painful and would see if I could “work around” any problems that came up with the current code base that I had created – after all, the goal of the framework wasn’t to do bitchin’ data flow analysis, rather it was to create a simulation framework so I could start working on the MMOG infrastructure. Diving further into the data flow analysis really felt like hovering on the edge of a black hole’s event horizon. The last thing I wanted to do was get sucked into that.
Sadly, as I mentioned at the beginning of the post, it became immediately obvious that even butt simple use cases required a lot more analysis than what I was doing and that the work arounds – if they existed at all for some cases – would be quite painful. So I was faced with doing a lot of debugging work to see if I could fix the particular cases that we were trying or to head off in a more radical direction.
I had been mulling over dumping the entire byte code transformation mechanism in favor a straight proxy model. The obvious advantages to the proxy model is that you can encapsulate the behavior required to implement the event driven simulation machinery in the proxy, meaning that you don’t have to distribute the machinery via code rewriting to all the potential call sites. However, the proxy model has a number of disadvantages as well which kept me from simply sitting down and doing the work to change the underlying system to a proxy model.
Probably the biggest disadvantage to the proxy model is the favored choice of using strict interface based proxies. This is the common model of RMI and pretty much any proxy system. You simply provide your proxies based on interfaces and now you can replace any references to the original object with these proxies. This is great, as far as it goes, but it’s not really a natural model. There’s a lot of fudging which goes on, and it’s notoriously easy to break through the proxy barrier and find yourself working with the real object by accident. In addition, working with pure interfaces isn’t always a natural model for people who are not programmers. Sure, we all use interfaces like we were born doing it, but it’s a simple fact that most people who model do so with classes. I really wanted to preserve the ability to work with straight classes as my simulation entities and resisted this strong pull for interface based proxies.
Now, it’s relatively straight forward to make proxies which work for classes. CGLIB and other frameworks do a quite nice job of making proxies for concrete classes – it’s not exactly rocket science, after all. But the problem – from my perspective – of using these types of proxies is that you now have significant overhead due to the fact that your proxy has a bunch of unused instance slots taking up a lot of space. If the number of proxies is small in your systems, then this is no big deal – i.e. the space overhead really is minimal so no need worrying about it. But in the MMOG systems I’m trying to tackle with Prime Mover, the number of entities is huge – many,many millions of simulation entities. It’s like the absolute worst of all the fine grain distributed object nightmares rolled into one and amplified by several powers of ten.
So while I could solve the immediate problem by using class based proxies, I knew that in the end that my life would suck even more than it did now because the stated goal of the project – i.e. to provide a massively scalable online gaming system – would be thwarted by the simple fact that my storage requirements would easily double – if not more, depending on how the statistics of references worked out.
Thus, after a few beers on Friday I was resigned to changing the system to use the class based proxies and put off dealing with the space issues at a later date – treating the space overhead as an “optimization” issue which could be dealt with in some fashion. Magic pixie dust or leprechauns, I guess. Not the best solution, but hey. Maybe I couldn’t, in the end, produce the kind of system I was hoping to. It happens.
And so I dug into the problem Saturday, taking out the big knives and hacking away at the underbrush that composed the byte code rewriting. By late Saturday night I had most of the system transformed to use the proxy based system and was feeling pretty good about the work I’d accomplished. However, the phone rings and Stefan’s on the line – apologizing for calling late, of course – talking to me about the idea he had while working in his garden. The idea was something that I had briefly fantasized about but discarded for reasons which I’ll soon explain. However, the idea that Stefan described was simple: Why waste the slots in the proxy? Why couldn’t we unify the proxy and the simulation entity into one object?
As I said, I had briefly fantasized about this but had rejected it because one of the problems would be actually breaking the references between entities upon serialization. As I’ve mentioned, I’m going to be relying on Coherence to scale as well as provide continuous availability and that requires serializing the entities. Breaking the references between entities is required or else you get one massive multi-terra bit blob of goo.
But as I was talking with Stefan, I came to realize that this was simply an issue with serialization. I could easily control what happens when I serialize a proxy from a referrer – i.e. write out the UUID which uniquely describes the entity, which takes precisely 4 longs. And I could just as easily use an alternate serialization for the entity qua entity which serialized the actual state of the entity. Once I realized that I could have two serialization schemes which coexisted side by side, I realized that I could, indeed, unify the proxy and the actual state of the entity and eliminate all the overhead I was fearing.
So now, what I do is simply generate a subclass for each entity which implements the mechanics of the event driven simulation framework – e.g. asynchronous event processing and continuations. Then the only byte code rewriting I have to do is focussed on the creation of the entities themselves. Analyzing the code to find the sites where the entity instance is created and rewriting them is a quite trivial and completely robust process, given that to create an instance of a type, you have to actually refer to the actual type. Of course, this doesn’t deal with creation via reflection, but hey. Everything has limitations and the domain of modeling and simulation isn’t the wild wild west of cowboy programming. And in any event, the transformation is quite trivial and straight forward to do. Considering that Prime Mover’s own bootstrapping has to use reflection, the use of reflection – if you have to – in the model is quite easy.
The end result of all this is that Prime Mover is now in far better shape to face the future and I feel much more confident in the ability of the system to maintain its semantics and deal with all the slings and arrows we’re going to virtually throw at it in the next phases of development. I’ve eliminated an extremely nasty data flow problem, a potentially crippling space inefficiency overhead and discovered a new kind of Proxy trick (I don’t think I’ve ever seen this technique anywhere, but the world is large) that had wide applicability beyond the little universe of Prime Mover.
Not bad for a weekend’s work.