Thursday, October 30, 2008

Mavenizaton - A real life scenario

The other day I posted a message on a certain vendor’s support forum asking about the possibility to include minimal POMs in their jars (so they will be easier to mass-import into local repository) and I got the following reply:

We want to move from ant to maven for a long time but never got a chance to do it. Our ant script is pretty big and complex so we are afraid it will take us a while to migrate to maven. Since we have other important things to work on, we keep delaying this kind of infrastructure things that have relatively less impact than bugs. If you know any way to migrate from ant to maven quickly, I would love to hear.

So I asked if I can get access to their Ant script and they kindly sent me that 3000-lines of a beast.

The Challenge

The product is comprised of several modules which have well-defined dependencies on one another and can be used in various configurations. Each configuration is described in a build.properties file by xxx-source-path, xxx-properties-path and xxx-output-path properties. The source-path contains both Java files and misc. resources (there are no tests). The xxx-properties-path contains property files only.

In the end, for each module the build produces the following artifacts:

  • JAR files – depending on properties, these can be:
    • Obfuscated with Zelix KlassMaster
    • Debug builds (still obfuscated, but with line-numbers)
    • Postprocessed by Retroweaver for Java 1.4 compatibility
  • Zipped sources
  • Zipped source stubs – public and protected method definitions and javadoc (no method bodies)
  • Zipped Javadocs
  • User manual in PDF format

In addition to these primary artifacts, the project:

  • Generates a couple of demo applications packaged as single-click JARs and JNLP.
  • Filters all property and some source files, substituting placeholders with values defined in the build.properties.
  • Compiles on the fly a doclet and then uses it to generate the source stubs.
  • Creates a bunch of bundles, containing different combinations of the aforementioned artifacts.

The vendor has a policy of releasing new demo-build every day which expires after 28 days.

The First Step – Fake It

In fact, we, the customers don’t care whether you are using Maven internally. All we care is to receive a POM with correct dependencies and (if possible) to be able to download the libraries from the official Maven repository (a.k.a. Repo1) or if not – your private repository (i.e. http://repo.your-company.com/demo or http://repo.your-company.com/licensed ).

In order to achieve this:

  1. For each artifact, create a pom.properties file containing 3 properties: version, groupId and artifactId.
  2. For each artifact, create a minimal POM containg the artifact group, name, version and dependencies information. Alternatively you can put placeholders and use Ant filtering to resolve the group/name/version from the pom.properties
  3. Integrate the new files into your build process, treating them as yet another resource file and placing them under the META-INF/maven/your/artifact/group in each artifact JAR.
  4. If desired use artifact:deploy task to push the artifacts to a Maven repository.

This is all you need to do for the ‘regular’ JARs. For the various other files, you would want to use the same artifact name and a classifier I would suggest using:

  • sources – standard Maven convention, since the full sources are not released to regular customers, it would make sense to have them either under a different classifier (e.g. fullsources) or same classifier, but from a private repo
  • javadocs – standard Maven convention
  • jdk14 – for retrozapped binaries
  • debug – for obfuscated binaries with line numbers

Then you Migrate

If we step back, the current build proces is something like this:

  1. Preparation Phase
    1. Set variables
    2. Compile custom doclet
  2. Generate Sources
    1. Filter particular files to resolve values from properties
  3. Compile
  4. Copy resources
  5. Package binaries (create jar file)
  6. Obfuscate
  7. Retrozap
  8. Javadocs
  9. Source Stubs
  10. Assemble distribution archives
  11. Build the demos

Here is how it would look after we convert the build to Maven:

  1. All the path variables go away, replaced by standard Maven directory layout (you should be able to reorganize the source trees for a few hours on a Friday afternoon.) As a last resort you can define custom directories in the build section.
  2. The company-name, version, etc. props go away replaced by the respective POM elements.
  3. The rest of the settings are specified in profiles defined on project or user level.
  4. The resource filtering is a built in feature for maven, but if you need to pre-process Java files, you would need to attach your own plugin to the grnerate-sources lifecycle phase.
  5. Compile and Package are standard Maven targets and you don’t need to do anything.
  6. The ZKM obfuscation would be tricky because there is no Maven plug in (or at least I couldn’t find any). I suggest that you use the AntRun plugin as a temporary hack.
  7. The Javadoc plugin would automatically generate your Javadocs, you need to explicitly specify a second execution for your custom doclet
  8. The assembly plugin should help you organize your different distributions.

This leaves out a lot of details, but ultimately some of the decisions are yours to make. I would suggest using a script to reorganize your sources, so you can do a number of dry-runs and verify that the result is as expected before you change the script to perform the move in the version control system.

Project Layout Considerations

Each Maven POM has exactly one parent, but can be part of multiple reactors. Both parent and reactors are pom-packaged artifacts, but their role is different. Most tutorials use the same pom for both purposes, but there is benefit in separating the functions (more about this in a bit).

POM Inheritance

The role of the parent is to provide defaults for the child POMs that inherit from it. The POMs are resolved by the normal Maven dependency resolution mechanism, which means that if you want the child projects to reflect a change in a parent-pom, you need to install/deploy the parent first. Alternatively, you can use the project/parent/relativePath element to explicitly specify the parent-pom’s path in the file layout (in this case, the changes are picked from the file-system.)

At the top-level parent you should use dependencyManagement to specify versions and transitive-dependency-exclusions for all artifacts used by the project. No child-artifacts should explicitly declare versions and exclusions.

You should use pluginManagement to fix the plugin versions and (optionally) specify common options for plugins. It’s ok (although not recommended) to override the plugin options in the children-pom’s, but you should not specify versions there. If you need the same overrides in multiple children, consider extracting them into intermediary-parent pom.

Finally, you should try to specify most of your profiles on a parent-pom level. This, combined with the pluginManagement allows you to define the bulk your build in the parent and leave the children DRY.

The children have to explicitly specify the master-pom version which makes it a bit of a pain to increment the version (on the other side, the master-pom usually changes much slower.)

Building in Dependency Order

The normal way for Maven to resolve dependencies between projects is through the local repository. This means that if project A depends on project B and we make a change in B, in order to test A against the new change, we need to first install or deploy B and then build A, so it would pick up the new changes (this is assuming we’re using SNAPSHOT versions).

The reactor projects take care about such closely related artifacts and automatically rebuild them in the correct order. Also they act as a grouping for the sub-projects (useful when you build assemblies). The reactors themselves are artifacts using pom-packaging (same as parents), but they explicitly list the relative path to each constituting module, and this is the reason that I recommend separation between parents and reactors:

Creating Distributions

The standard way of packaging distributions in Maven is using the Assembly plugin . It allows you to describe your distribution package in terms of project resources, artifactsIds and classifiers. A typical assembly descriptor would be:

  • Create a tar and zip files with classifier ‘distro’.
    • Add all files from /src/main/assembly
      • excluding **.xml, and **.sh.
    • Add all files from /src/main/assembly
      • that match **.sh
      • and set the unix permissions to 551.
    • Add the maven artifact
      • with group-id=com.company.product.modules and artifact-id=module1 and classifier=obfuscated
      • put it under lib.
    • Add all the transitive dependencies of this project (assuming this descriptor is in a reactor project)
      • and put them under lib/ext

You can also unpack and consolidate JARs, specify manifest-entries and signing for single-JAR distributions and so on.

Each artifact can produce more than one assemblies. If you want your assemblies to go into the repo, make sure you use the ‘attached’ goal. Most of the times, I use the reactor poms for distributions, but sometimes I just define the assembly in the JAR module’s POM.

Parents and Children vs. Reactors and Modules
Category Parent/Child Reactor/Module Merged Parent+Reactor
Direction of Coupling Children know the names of their parent, parents don’t know their children Reactors knows the locations of their modules, modules don’t even know whether they are being built as a part of a reactor Creates circular dependency between children/modules and parent/reactor artifacts.
Multiplicity Each child has exactly one parent, specified in the child POM. A module can be part of more than one reactors. You must keep it to one reactor. If you define alternative reactor, the parent will still be resolved to the main parent-POM (as specified in the children POM’s), which can lead to a great deal of confusion (and undefined behaviour for some plugins).
Location The parent is usually resolved through the repo (optionally through relative path) The modules are always resolved through relative path You need to physically layout your project, so the parent POM is accessible by relative path (which is not needed otherwise).
Change Drivers The parent changes when your build process changes. The reactor changes when you add a new module By the artificial coupling, your POM is subject to more change. This is particularly bad for the parent role, because you need to manually update the parent-versions of the children.

Given this table, I’d recommend a project layout along these lines:

-+-+- project-root/
 |
 +-+- modules/
 | |
 | +-+- module1/ 
 | | |
 | | +--- pom.xml       | inherit from modules parent; specify dependencies 
 | | |                  | and module details (name, version, etc.)
 | | +-+- src/
 | |   +-+- main/
 | |   | +--- java/
 | |   | +--- resources/
 | |   |
 | |   +-+- test/
 | |     +--- java/
 | |     +--- resources/
 | |
 | | 
 | +-+- module2/ 
 | | |
 | | ...
 | |
 | +-+- module3/
 | | |
 | | ...
 | |
 | +-+- reactor1/       | e.g. all modules - used for the CI build and pre-commit
 | | |
 | | +--- pom.xml       | no parent; specify the modules as ../module1, ../module2, etc.
 | |
 | +-+- reactor2/       | e.g. only module2+module3 - used by the module3 developers only
 | | |
 | | +--- pom.xml   
 | |
 | +--- pom.xml         | modules parent - inherits from the main parent pom and specifies 
 |                      | module-specific profiles and plugin settings
 |
 +-+- applications/
 | |
 | +-+- application1/ 
 | | |
 | | +--- pom.xml       | inherit from apps parent; specify dependencies, module details 
 | | |                  | and whatever custom steps necesarry for the application
 | | ...
 | | 
 | +-+- application2/ 
 | | |
 | | ...
 | |
 | +-+- reactor/        | e.g. all applications
 | | |
 | | +--- pom.xml       | no parent; specify the modules as ../application1, ../application2, etc.
 | |  
 | |
 | +--- pom.xml         | apps parent - inherits from the main parent pom and specifies 
 |                      | module-specific profiles and plugin settings
 |
 +--- main-reactor      | specify as modules ../applications/reactor and ../modules/reactor1
 |
 +--- pom.xml           | the main parent pom

A few Tips

When You getStuck with a Bad Plugin

For all operations that you can’t easily achieve with the existing Maven plugins, you can fall back to the Antrun plugin and reuse snippets of your current build script. Yes, it’s a hack, but it gets the job done and you can fix it when it breaks.

Try to keep these hacks in the parent pom.

Dependency Management

When you define dependencies, take care to specify proper scopes (default is ‘compile’, also you should consider ‘provided’ and ‘test’). You should also mark all optional dependencies like POI HSSF. For some artifacts you might need to specify dependency-excludes, so Maven doesn’t bundle half of the Apache Commons with you.

Evaluation Builds

You might want to release the daily evaluation builds as snapshots. Maven has specific semantic for snapshots that it always refreshes them based on timestamp. For example, if you release an artifact my.group:module:1.0.0, Maven will download it once to the local repository and never look for another version unless you delete it manually (that's why you should NEVER re-cut a release once it's been pushed out - just declare the version broken and increment).

If you use a snapshot version (i.e. my.group:module:1.0.0-eval-SNAPSHOT), Maven will check once a day (unless forced with '-u') and automatically any new version base on timestamp. To achieve this you might want to cut an eval-branch from the tag of each release and change the version using a shell script or something.

Saturday, October 25, 2008

Why ORM?

I like Spring JDBC Template - it is simple, clean and saves a lot from the JDBC drudgery. It provides nice abstraction, simplifying the usage and management of prepared statements and overall if all you want is to do an ocasional query, it doesn't get any simpler. Oh, and you have full control on the underlying JDBC connection, allowing you to exploit any vendor-specific functionality. The provided functionality only makes the data access easier and supports only very basic property to column mapping out of the box (or you can write your own mapper in Java). Because of its lightweight nature, JDBC Template does not allow for transparent caching, horizontal table partitioning, vertical sharding, etc. It also doesn't offer lazy fetching of relations, identity tracking and other instrumentation magic.

Another tool I like is iBATIS. In comparison with JDBC template, it offers more structure, organizing my SQL statements in a logical way and supports most common forms of mapping the result sets to beans (including graphs of objects and joins). Ibatis does support caching (only on a whole query level), still no sharding and partitioning, lazy fetching and identity tracking. IBATIS doesn't try to parse your queries.

The Abator tool can generate iBATIS CRUD mappings, DDL from object model, and object model from a database, but frankly - in its generated form all these suck. The DDL uses only a few datatypes, has no RI or semantic constraints, nor indexes (I use it sometimes, but only as a boilerplate which I extensively tweak afterwards). The generated beans are isomorphic with the underlying tables, which is great if you are doing CRUD, but I find that most of my applications do data analysis, so the queries tend to be quite hairy, half of them encapsulated in stored procedures (for transaction processing I'd rather use Spring or plain JDBC anyway). The bottom line is that Abator handles the trivial cases, but for most of my projects it doesn't help much, so I'm back to manual coding (which is not that bad anyway.)

The first time I looked at ORM was when I checked TopLink for a project around 2000. Back then I got the impression that it introduces a lot of new concepts in order to simplify something that's already simple enough - loading a couple of beans from a table. For that project I used plain JDBC, it took me perhaps 20 mins to write and I can't remember taking much time to maintain that code - the total cost should have been less than 3 man-hours.

The second time I tried Hibernate and Toplink last year (2006) on a prototype with the same dataset and relatively small domain model (for which I finally ended up using iBATIS). The first thing that stroke me was how slow everything was - the startup time was noticeably slower than with iBATIS (about 500-800%), spent in initializing the JPA entity manager and the queries were consistently 30-50% slower. It might have been that I didn't know how to tune it, but I just couldn't achieve satisfactory results. I researched the facilities for mapping to externally defined schema and it stroke me that both Hibernate and TL are way clunkier than iBatis. There were also a number of restrictions that suggested that I need to modify my schema just to enable certain features of the tool... in the end, the identity tracking is a leaky abstraction - the whole story with the attached and detached entities smells to me.

Now the thousand dolar question, given that most applications out there DO NOT use horizontal partitioning or sharding, caching in the application layer is usually more efficient anyway, identity tracking and the transparent (hidden) lazy fetching often causes more problems than it solves (and countless hours lost in app tuning with dubious results), why are so many people obsessed with the idea of cramming an ORM into their application?

The other day I wrote a small application that had four composite entities and a couple of fairly sophisticated queries, the the data access part took me about less than an hour (that's implementation, testing and tuning), so:

  • How long would it have taken me if I used JPA? (to get reasonably well performing, tested code)
  • If you are using ORM, how many entities do you have? How much time (percentage) do you spend working on your persistence logic? Are you using the database as operational storage or are you trating it as a long-term asset? Do you share the DB with other applications?
  • Are the productivity gains realized only when your persistent model is so big that it's unmanageable by other means? Is it to futureproof a product (what is a reasonable timeframe for ROI, any examples?)?
  • How important is for you the ability to scale your data access to multiple database servers?
  • Have you ever switched an application from one SQL server to another? How much effort was it? What does the application do with the DB?
  • What are the benefits of using ORM smart caching and partial queries in comparison with in-memory database like TimesTen or H2?

Please enlighten me...

Wednesday, September 10, 2008

On Scopes

Let's define component as a programming abstraction that exposes a number of interfaces and is instantiated and managed by a container. A component has implementation that sits behind the interface and provides the functionality, which could be a custom Java class, a proxy, a BPEL engine, a mock object, or even a mechanical turk solving captchas.

In the case of Java class components, they are usually composed of smaller Java classes like the ones in java.lang.* and java.util.* packages. All but the most simple Java objects are actually object graphs which are instantiated, wired togerher and orchestrated by the Java language and the JVM. The Java language specifies the composition structure and behavior, while JVM provides the means to invoke operations and share data. In contrast, the components are instantiated by a container and the container is responsible to wire them together (usually based on a blueprint in the form of a configuration or class annotations). The container can also provide a number of declarative services to its components: transaction management, access control, performance statistics and transparent distribution. Some popular containers are OSGi, Spring, EJB, ActiveX, MTS/COM+.

Even if a container can expose a local component over remote interface without requiring any code changes, practice has shown that taking a single-server application and distributing it in such fashion usually results in a disaster. This doesn't mean that the non-invasive distribution support is useless though. Given responsible usage, it can enable you to build distributed applications without having to deal with the transport API, using clean and reusable code (POJO). There are a number of new frameworks that are built around the concept and try to provide the right level of transparency and control (Mule ESB, Spring Remoting, Spring Integration are some popular choices).

Once you get too many components in the same contaner, autowiring starts doing most unexpected things, debugging becomes difficult and you start loosing track of your system. The solution is to add scoping. Spring supports limited scoping facilities in the form of parent-child application contexts. Although it solves the problem with configuration visibility, it does not provide facilities for controlling the public interface of the scope (explicitly exported objects). OSGi tackles this problem from a different direction using controlled classloaders, Spring OSGi is trying to be the best of both worlds, giving you scoped configuration and runtime autodiscoverable scoped components.

<interlude/>

I just finished reading Distributed Event Based Systems and they have an excellent chapter on utilizing scopes as means to structure a distributed event-driven application (chapter 6). They suggest a scoped architecture, where each component can belong to multiple scopes, scopes are linked in an arbitrary graph and notifications are governed by the different scope and component properties. The scopes described can declaratively provide a number of services (think AOP for messages). Some of these services are:

  • Containment - logical grouping of components and other scopes, allowing easy addressing for event dissemination and policy enforcement.
  • Scope Interface - defining which messages can cross the scope boundary and enforcing this policy
  • Transmission policy - determine which scope components will receive a certain message and/or which components have a right to publish/receive messages that cross the scope boundary.
  • Mappings - when a message crosses the scope boundary, enrich or transform it based on the source and destination scopes.

Scopes could be used to delineate groups with different QoS requirements, or to partition the application in geographical or security domains (or even both at the same time) and apply the appropriate policies. Integrating an external systems is also easier, because they are treated as yet another scope (as opposed to "enemy in the bee's nest") and you can provide rich access to the messages that the third party is entitled to without having to bolt on yet another authentication and authorization mechanism.

The book focuses in great detail and implementation strategies, analyzing different approaches ranging from simplistic (flooding) to very complex (integrated routing). Right now, the described architecture can not be implemented directly in any eventing system that I know of, though we can see some of the ideas in Jini (peer to peer with cooperative filtering), JMS (in implementation supporting topic hierarchies), OSGi (intra-JVM) or SCA (a lot of talk about scopes and composites, but I don't see them doing anything useful with them). The authors also propose a configuration language for managing scoped systems with the specified features. I can see AMQP as a flexible low-level specification that would allow to build something like this by implementing custom exchanges. We can use the scope configuration language as an intermediate representation, from which we can generate the custom exchange code.

Ultimately, I believe that in 10 years from now this scoping and routing facilities will become standard part of the messaging middleware (and perhaps hardware?) I'm looking in the general direction of AMQP, possibly SCA (despite their WS-* fetish) and new high level languages directly expressing architectural concerns like ACME and Einstein.

Tuesday, September 2, 2008

Dynamic components in Mule

Imagine that you are given to implement the following scenario:

  • A system is composed of a number of uniform entities, each entity publishes its status on a separate pub-sub topic and accepting instructions over separate control queue.
  • Your application receives a stream of instructions over a control channel, instructing it to monitor and control or stop controlling certain entity.

or you receive a stream of active orders and want to subscribe to the right marketdata to calculate a relative performance benchmark, or...

Usually I write about how to implement stuff with Mule, but this time I'm going to write why Mule might not be your best choice if your use-case is anything like above.

The most straight forward approach for implementing the system above is to create a simple POJO, encapsulating the management algorithm and have a service listen to the control channel and instantiate/dispose a new component for each managed entity.

The problem is that in Mule the components are somewhat heavyweight and it's not so nice when you start having hundreds (approaching thousands) of them. Another problem is that the code for manually creating and registering a componment along with the whole paraphernalia of routers, endpoints, filters and transformers is quite verbose.

The latter problem is easier to solve. There are a couple of JIRA issues that aim to simplify the programmatic configuration (MULE-2228, MULE-3495, MULE-3495). The Annotations project at Muleforge is also worth watching (right now there's no documentation, so you'd need to dig in Subversion).

The problem with the too many components is more fundamental. The main application domain for Mule is service integration. Though it also happens to be a fairly good application platform, it does fall short when it comes to managing a lot of components. One of the important things missing is the ability to create groups of components with shared thread pools. Another missing thing is better support for variable subscriptions (provide an API to easily add and remove subscriptions to the same component using some kind of template). Another notable improvement would be to elaborate the models into full blown scope-controllers, allowing to create composite applications with proper interfaces, routing policies, etc. All of these are addressable, but there is a long way to go.

There are also some improvements that can be done to the JMX agent - use JMX relation service to connect component to its endpoint and statistics, all components of certain type, user defined groups of components. Allow the component to exercise some control over the Mule-generated ObjectName so users can organize the different components in groups and ease the monitoring by EMS tools by allowing to query the parameters from a specific group.

Please share if you've had similar issues, how you dealt with them and whether you think {with the benefit of a hindsight) that the solution was worth the effort. Also I'd like to hear from users/developers of other integration frameworks or composite application containers (Newton?) how they handle these problems.

Monday, September 1, 2008

Transformers vs. Components - Take 2

In my previous article, I claimed that one of the main criteria for which Mule artifact to choose is the multiplicity of the inbound and outbound messages. I've been thinking about it and I want to add two more points:

  • Transformers are one instance per endpoint reference. This means that even if you use pooled components, all messages go through single transformer instance. In such case, you need to make sure that the transformer is threadsafe (which is not true if it is holding a JDBC connection or Hibernate session). The components do not have to be threadsafe, though if you are injecting some shared state you still need to properly synchronize the access to it.
  • Mule2 services allow you to configure different ways of instantiating the component (singleton, pooled, looked up from Spring/JNDI/OSGi/etc.). In contrast, all transformers defined using the mule-core namespace are prototype-scoped.
  • I actualy agree that direct database access belongs in a component. I still think that accessing a cache is kindof a gray area though.

Saturday, August 16, 2008

InfoQ, Firefox Hacks and Media Snobbery

I happen to be one of those people that don't like TV - it's not that I don't watch it when I get a chance, but I've made a conscious decision not to buy one and to get my news and movies from other sources. My primary issue is that TV as a media is too engaging, passive and slow. Radio is also passive and slow, but less engaging, hence better.

For example: in 10 mins browsing news sites, I can learn more than 10 mins of browsing newspapers, which still fare better than 10 mins of radio, which is just the same as 10 mins of TV, except that when I watch TV I don't do anything else.

I'm a much faster reader than listener. When I read I can always stop and mull over a sentence for a couple of minutes, go back or skip forward with natural ease. Not so with video and audio. Even though most streaming media these days have some kind of non-linear controls (skip-bar, etc.) it's still impossible to do the equivalent of scanning the headlines or speed-reading in the case of podcast or video content.

One of the reasons I like the InfoQ interviews is that they have full transcripts. One of the resons I hate them is that the transcript is enclosed in small box, 250px high in the middle of the screen and one has to click the open and close buttons and scroll like crazy. What I often end up doing is to fire DOM Inspector, cut off all the DOM nodes except the ancestors of #interviewContent, switch to CSS properties, delete the height and max-height props and this way I end up with full-screen view of the transcript, which I can print or browse at my convenience.

Today as I was debugging some stuff, I remembered that Firefox supports user stylesheets. By putting the bellow statements in my ${FIREFOX_PROFILE}/chrome/userContent.css, I was able to expand the transcript to be readable in a browser without inner scrolling and tweak the printable view to display only the relevant content (no empty squares in place of the flash player or vendor links I can't click on paper.)

/*
 * This file can be used to apply a style to all web pages you view 
 * Rules without !important are overruled by author rules if the author sets any. 
 * Rules with !important overrule author rules. 
 * 
 * For more examples see http://www.mozilla.org/unix/customizing.html 
 */ 
 
div#interviewContent, 
div#interviewContent * { 
    height: auto ! important; 
    max-height: none ! important; 
    overflow: visible ! important; 
} 
 
div#interviewContent div div div { 
    display: block ! important; 
} 
 
@media print { 
    * { 
        border: 0px ! important; 
        float: none ! important; 
    } 
 
    #content, 
    #container, 
    #content-wrapper, 
    .box .bottom-corners, 
    .box .bottom-corners div, 
    .box .top-corners, 
    .box .top-corners div, 
    .box .box-content, 
    .box .box-content-2, 
    .box .box-bottom, 
    .box .box-content-3 { 
        margin: 0 ! important; 
        padding: 0 ! important ; 
        float : none ! important; 
        width: 100% ! important; 
        line-height: 130% ! important; 
    } 
 
    /* these don't work on paper */ 
    object, 
    .comments-sort, 
    .vendor-content-box { 
        display: none ! important; 
    } 
 
    /* visual garbage */ 
    .tags3, 
    .comment-reply, 
    .comment-header > p, 
    .comment-footer, 
    #interviewContent img[alt='show all'], 
    #interviewContent img[alt='hide all'], 
    #content-wrapper > .box > h2, 
    #footer { 
        display: none ! important; 
    } 
 
    /* Fix comment headings */ 
    .comment-header * { 
        display: inline ! important; 
    } 
 
    div.comment-header { 
        border-top: 1px solid gray ! important; 
    } 
 
} 
Now the only thing left is to convince the InfoQ guys to start providing transcripts for the presentations and downloadable slides :-)

Wednesday, July 23, 2008

Transformers vs Components

When I started using Mule, for some time I used to wonder what's the big difference between a transformer and a component. Why don't we just stick with the plain pipes and filters model and we complicate it with the 'component' concept?

In general, transformers have conceptually simpler interface - they are supposed to get a payload and return a payload, possibly of different type. You should prefer transformers for reusable transformation tasks. On the other hand, transformers can not:

  • Swallow messages - to achieve this use filter or router.
  • Return multiple messages - use router.
  • Pause temporarily

In addition, despite their more complex configuration components provide some neat features:

  • Message queuing - if your message goes through multiple stages of enrichment, each taking variable time, you might be better off implementing the stages as components and separating them with VM queues.
  • Statistics - automatically measures the processing time and volume through each in or outbound endpoint.
  • Lifecycle control - you can start, stop or pause component multiple times (while a component is not running the messages will queue up.) The transformers are only initialized and disposed once.
  • A componen allows you to tack on multiple filters, transformers, routers, etc. A transformer is one part of this chain.

To put it together, here is an overview of Mule's component roles in a simple message routing sequence (I'm trying to explain the roles, not the exact sequence.)

  • The message enters through an inbound endpoint, where the transport extracts the payload and the message headers and bundles them in a message. (The message and some other objects are bundled in EventContext, but it is still a mystery to me why do we need it.)
  • After the message is created it passes through a filter chain, where the filters are applied one after another and each of them votes with true or false a full consensus is required or otherwise the message is rejected. Depending on the config the rejected messages can be routed somewhere else. The filters are not supposed to modify the message (in practice they can, but it's a very bad idea.)
  • If a message passes all filters it goes through a transformer chain, where each transformer is applied to the output of the preceding one. A transformer is not able to gracefully reject a message, but it can throw an exception.
  • We might have an interceptor wrapped around the component, which can log stuff before or after invocation or measure latency, etc.
  • Based on the payload and on the component type, Mule resolves the method it should call and if necessarry transforms the payload to something that fits. The return value of the method is used as a payload to a single outgoing message.
  • The outgoing message is passed to an outbound router, that can split it in multiple messages or just drop it. The router then usually sends the message(s) to arbitrary number of outgoing endpoints. The outgoing router uses the filter chains of the outgoing endpoints to decide which ones to use. The outgoing endpoints can be Mule internal endpoints or periphery endpoints, conecting the application to external system.

In general, I would recommend that you start with transformer and refactor it to a component if you realize that you want to add any of the aspects mentioned above. You can even easily use transformers as POJO components.

Monday, July 21, 2008

Testing Mule applications

Antoine Borg has posted an intersting article at Ricston Blog about the feasibility of using mock objects for testing Mule applications. The bottom line is that in integration scenario is usually difficult to capture the application specification in mocks and often it's easier to write stub applications simulating the external systems (outside of your application process).

I've found that often I don't really test the whole application, but instead I test single components with a few attached transformers, filters and routers. I start by writing a simple Mule configuration - a single component using VM endpoints with queuing. Then I use MuleClient to feed in canned input data and assert the output from the outbound endpoint(s). As the application takes shape, I add transformers and routers as needed to approximate the real usage.

I could imagine that the next step would be to reuse my production configuration and extract the perimeter endpoint definitions in a new file (separating them from the model, connectors and internal endpoints) and pass the two configuration files to the config builder. This would allow me to create an alternative perimeter endpoints file using VM endpoints, so I can instantiate it in test case and use MuleClient for testing.

The benefits of the approach are that you are testing your component and routing logic and are not exposed to the peculiarities of the external system. Ideally we still want to have a full integration tests, including ones covering crash-failure, failover, connectivity loss and overload scenarios. We can achieve parts of this by stubbing the external system, but so far I usually find it difficult to reproduce the behaviour faithfully enough (especially when we have limited understanding about the external system).

No bulletpoints this time.

Monday, June 16, 2008

Build, Buy or Steal

This post is based on a text I wrote at my day job. It is edited to reflect my personal opinions and does not represent my employer’s view in any way.

The classic 'build vs. buy' dilemma is further complicated by the new wave of quality open source components.

Even if we assume that the open-source alternatives are generally inferior to their closed-source counterparts, their low price makes them an attractive choice for getting started, provided that there is an easy migration path to more powerful solutions. Additionally, many open-source tools are backed by commercial companies which can provide support, consulting and guidance on demand.

Open-source components are often (not always) higher quality than the internally developed libraries. This is because of the usually pretty-good community QA (many users; no time pressures) and the small-vendor mentality of the backing firms (they need to be much better than the industry standard in order to convince a client). Also, a solid OS project provides a decent documentation and community support, often has books written about it and, in general, makes it easier to recruit people already familiar with it.

We do want to build components in-house when we believe that we can provide and need more value than their OS and commercial counterparts can provide. We always need to keep in account that building good library is a continuous investment in new features, bug fixing and documentation. Providing a sub-optimal internal component is the worst of all worlds.

Another big question when to contribute our changes back to an open source project. Conventional wisdom says that you shouldn't give our work for free. Still, given the previous paragraph, I would claim that it's in our best commercial interest to contribute back any changes to the external module's core. In such cases, the initial investment is small, the maintenance is high (as we need to reapply changes with every release) and the business value of the code is low. Alternative is to fork the open-source component, which brings us to the problem in the previous paragraph.

We might also decide to contribute or open source internally developed extensions of Open Source library, gaining free QA and possibly bugfixes. We should not open source code that gives us [our product] significant advantage over alternative solutions on the market. We must not share non-generic code, capturing: business processes, algorithms, site-specific logic, etc.

And now is the time for the mandatory list at the end of the post. I'm going to enumerate a few build vs buy decision anti-patterns:

  • Not Invented Here (NIH) Syndrome - you know, when we write your own thing because we are too lazy to do our research or read the docs.
  • The Wrapping Party - every external library is wrapped in order to integrate into the proprietary architectural framework. Though it might look like a good decision, it often leads to difficulties in the debugging, inability to apply best practices and tools, and generally inefficient use of the library
  • Nobody Got Fired for Buying Expensive Stuff - when technical decisions are taken by managers without enough information (or understanding). Sometimes this is rationalized as that all the products look the same on paper, so at least this one comes from a reputable company we can sue. Problem is that often the very expensive do-it-all products require a staff of rocket scientists (or vendor consultants) in order to deliver anything after that.
  • The First One is Free - some vendors try to promote their products as open source, while capturing your data and interfaces in proprietary formats and protocols and then selling services around them. The SOA RAD tools and BPM tools are particularly bad offenders here. The problem is that this limits the ways for evolving your platform. The way to prevent this is to always be aware what part of the solution is platform specific: configuration (is it documented file format), POJO vs proprietary interface components, can you get the whole solution as a bunch of text files, what protocols are used for communication, what is the data storage, can we plug our custom infrastructure, where?

Tuesday, June 3, 2008

The Shape of Complexity

Some things are complex and there's no way around it. Still, even if we can't remove the complexity, often we can shape it into different forms. Some applications have shallow and wide complexity - many simple things with relatively few dependencies, but the overall system has mind-boggling emergent behavior (my favorite 25kloc perl scripts example). Other applications have narrow and deep complexity - small code base, but using so much infrastructure amd metaprogramming, that to understand what's going on, you need to be expert in the platform (think about Hello World written with EJB2 or any application using Ruby on Rails).

The horizontal complexity is pretty easy to deal with - find two similar things and create abstraction for them, rinse and repeat. It is important to stop and look every now and then for similar abstractions and for ugly usages. Eliminate the similar abstractions by extracting common functionality in helpers or superclasses; deal with the ugly usages by splitting the abstraction.

By definition, when we introduce abstractions, our complexity becomes 'narrower' and 'taller'. If we put too much stuff into the abstract classes or we nest them too much, we might transform the bunch of simple classes that we couldn't understand when taken together into a somewhat smaller bunch of more complex classes that we can't understand even in isolation. If go overboard in the other direction, e.g. adding too many facades and convenience methods, the API becomes too big without providing enough benefit to learn it (canonical example - I know a project that has a Strings class, containing constants for empty string, single digits, punctoation marks, single letters and other. Apart from being pointless, more verbose and difficult to apply consistently, this also couples all classes in that project to the util package).

So, how do we end up with code base that is neither too tall, neither too wide, but just the right shape?

  • Acknowledge that the right shape depends on the individual - some people can cope with more abstraction, while others can remember more facts. A metric for abstraction efficiency can be defined as delta-loc/nlevels-of-indirection (where n is a constant bigger than 2).
  • Keep in mind the choice of tools - for example IntelliJ IDEA excels at navigating well-factored code, while Vi people often prefer decoupled classes that can be changed with low risk of impacting other areas of code.
  • Consider the infrastructure maturity - using XA transactions with JMS is simple, debugging buggy JMS (caugh.. activemq.. caugh) is entirely different issue.
  • The experience of the team is important - for some people JMS is the most obvious way to send a piece of data once and only once to another system; others treat JMS as black magic and resort to FTP and cunning rename+move schemes, involving multiple directories and recovery scenarios.
  • And finaly, even if different pieces of infrastructure can provide similar functionality, sometimes they vary by the way they do it - in my previous article about component granularity, I mentioned the fat component vs fat Mule configuration scenarios. In this case, we can move the complexity between the Java code and the Mule code. Mule provides out of the box abstractions for threading, routing and transforming, but it is not great as a general-purpose process definition language. Java provides general purpose language, but does not provide high level primitives for threading and routhing. Another example: consider running OLAP against OLTP schema and dedicated warehouse schema - it works, but the difference in speed and CPU utilization can be orders of magnitude.

Tuesday, May 27, 2008

Pushing Files

In the next couple of posts I'll try to capture a couple of scenarios that I think Mule (and in general ESB) fits in. The target audience are people like my friend who once asked me once 'Well, I got it - it's all about transports, transformers, routers and components. In the end, what would you use it for?'

When I started reading about Mule, the first thing that caught my eye was the huge collection of transports and third party modules. It allows you with a relatively little configuration to automate scenarios like sucking a file, uploading it to FTP and sending an email if it failed or moving it to a audit dir on success. Add a few lines and you can have Mule try to upload, if it fails, try to post to JMS queue and if it still fails automatically create a JIRA issue.

Still, even though Mule is easy to use, it's not simple. Actually even for basic use of Mule, you need to know XML, have idea about Enterprise Integration Patterns and be familiar with Java server applications. Suffice to say that many of my colleagues do not meet these requirements. Compare this with writing a bash or Perl script - it's much easier and quite often it is good enough. I wouldn't be a stretch to say that the TCO of a Perl script for the previous scenario is 10 times less than the TCO of an application built using Mule or any other free or commercial integrated solution.

Things change a bit when you have a big number of integration jobs. If the tasks are similar, a smart scripter would abstract away the processing, capturing it into a parameterized sub-routine or shell script, with multiple simple wrappers with different arguments. Unfortunately, much too often the strategy is copy-and-modify (especially when the guy who wrote the original script has left the company, and the guy who did the first 30 clones also). Cloning is a pragmatic, low-risk strategy since it minimizes the chance that you break stuff that works. Such developers are usually praised for their get-it-done, goal oriented attitude and management likes their low-risk approach. Still, the dirty secret that nobody likes to talk about is that the system grows in complexity until the moment when nobody knows what is it doing anymore (I think Skynet started that way).

Even we assume that the TCO of a single Perl script is 10 times less than the one of a Mule implementation, a single Mule instance with relatively little custom development can easily handle the job of all the 170 scripts that I counted in one directory on a production server I use. If we assume that these scripts were derived from 10 archetypes (actually it's less) this still gives us a cost reduction of 170%. Even more important - the Mule and Spring configurations give you a roadmap, which while not trivial is much easier to comprehend and audit than the 27,227 lines of repetitive Perl code.

To be fair, in this example I am comparing bad scripting code, written by multiple people over the time, without proper governance to an hypothetical Mule solution written by decent developers. Let's assume that you have a team of good scripting developers, writing well structured code and taking care to provide maintenance documentation and guidelines for extending the system. Actually, this would work fine. Effectively these guys will be defining a platform quite similar to what you get from an ESB (in the end it's all Turing-complete languages). The only catch is that in my experience, in a typical enterprise it's much less probable to meet such Perl coders than the above mentioned Java guys. The argument that Perl programs don't have to be messy is quite similar to "guns don't kill people".

Even if we assumed that the mythical Perl programers did exist (and you had all three of them in your IT department), there are still a couple of points that make an Mule an attractive solution.

  • Mule gives you a proven architectural framework, so you don't have to invent your own. This includes guidelines, some documentation (getting better) and abstract base classes for many of the EIP patterns.
  • The business code written against Mule is usually framework agnostic (POJO).
  • Various transports, transformers and out-of-the box components.
  • Handles advanced stuff as XA transactions and threading models that are just not possible using Perl.
  • Since the bulk of your application will be built on a platform, it will be easier to find people whi have some experience with it, which will shorten the learning curve.
  • Extensive JMX monitoring and management.

And here are some good reasons to keep cloning scripts:

  • Naturally resilient to regressions.
  • Allow you to hire cheaper developers (even if you hire expensive high-qualified ones, they will be just as good as the cheap ones).
  • If you plan to scrap the whole thing in the near future (i.e. because the whole system is being replaced)
  • If the job is not critical, doesn't change often and forking a script is so easy that there's no point investing in anything better.
  • leg·a·cy [ˈle-gə-sē] - code that actually works.

Monday, May 26, 2008

Component Granularity Musings

This article was written based on my experience with Mule ESB, but the general principles should apply for other similar products as well.

Short description of an ESB application

A typical ESB (Enterprise Service Bus) application starts with an endpoint, which is an abstraction for a way to receive some 'event' which would make us do some work. The event carries some information, which could be structured (i.e. JMS message or HTTP request) or simple (the timer fired) - we'll call this information 'payload'. The endpoint passes the event (containing the payload) to a 'component', which does some kind of work and optionally emits a single output event. The output event goes to an outbound endpoint which can send the result back to the client, write it into a database, relay it to an external system, or do something else.

This was a high-level description - the benefits seen from here are that we get to reuse the communication logic between applications and since the endpoints are abstract, we can easily swap the actual transports. Still, a typical ESB provides some additional functionality. In the example above, when we receive an event it was relayed to a predefined component, but that doesn't have to be the case. We can have multiple components configured in the ESB and use 'routers' to determine the component that will process this particular event.

The routers could determine the actual component depending on the payload properties (in which case we call them CBR - content based routers) or based on anything else (e.g. we can have a support call router which routes to different endpoints based on the time of day). When router receives an event, it optionally emits one or many events (unlike a component that optionally emits one). Also, components are supposed to contain exclusively business logic, while routers are mostly concerned with mediating the events in proper manner. Other useful types of routers include aggregating, splitting, resequencing and chaining routers.

For the sake of completeness there are also 'transformers' which can be plugged between endpoint and component to apply some transformation on the event payload, converting the data or adding additional information (enriching); and 'filters' which can be used to discard the messages based on some criteria.

Since the routers, filters and transformers allow us to conditionally specify the components and endpoint, we can choose whether we want to implement our logic inside the component or outside (in the ESB configuration).

Ultra-fine grained routing (anemic components)

When each component contains no control-flow statements (straight-line imperative code), all the application logic is implemented in the ESB configuration. Some would argue that a system implemented in this fashion is more flexible and allows for faster turnaround and better introspection. While these statements are true, we need to consider the other side as well.

First, you are mixing business logic and communication logic at the same level, which is bad since it forces the maintainers to understand both (ideally they should be separated as much as possible). Not only this, but your code becomes platform-dependent (the configuration is code and in this case the ESB becomes more of a platform and less of a glue). This in turn increases the chance that upgrades of the ESB software would break your stuff (especially if you use undocumented or experimental features). We also need to keep in mind that the domain complexity has not changed - we are just expressing it using a different language.

Second, chances are that you have more developers knowing Java than the ESB config language. Also, during development it's much easier to step through the code of a single component rather than trace a message as it goes through multiple queues and thread pools.

From operations point of view, there are more knobs to turn, which increases the chance to turn the wrong ones. The monitoring can take advantage that there are more inspection points, but then you need the consider the usefulness of this information. Do you really care to know how many executions have you got for sell orders and how many for buy orders?

Many vendors (IONA, TIBCO, Progress) offer proprietary process modeling and metadata management facilities, allowing to express your business rules, routing and transformations in graphical notation and enforcing integrity based on metadata. These are expensive products and well worth their money if your project is big enough and you want to accept the vendor lock in. In that case, make sure that you make the most of it and take the time to learn how to use them instead of putting a half-assed simplification layer on top of them (just an example - so far I've seen two wrappers around Spring (in different companies) aiming to make it 'easier' to use. No need to say that none of them had any documentation).

Ultra-coarse grained routing (fat components)

In this scenario we have one component only. Multiple inbound endpoints go in (possibly passing through transformers and filters); and a single stream of events goes out. We use a content-based router to dispatch each event from the output stream, to one of the multiple outbound endpoints.

This is a code-centric approach (you won't make much use from a box-and-arrows editor). It allows one to use Java and standard Java tools and debuggers to implement, unit-test and debug the bulk of the application, while still abstracting the communication code and mundane stuff like transformations. If you want to expose application details for monitoring you have to do it by manually registering your custom MBeans or using the Spring JMX exporter.

One sign that you should consider splitting your component is if you start doing threading. This includes maintaining worker pools, doing half-sync/half-async dispatching, messing with locking, synchronization and notifications.

Sometimes there might be better abstractions for some pieces of code. For example, if your component is implementing a simple generic transformation and you are using Mule, consider extracting a Transformer. On the other hand, if the transformation is simple, but you feel that it is not generic enough then you have a choice to a) move it to a transformer and have your component focus on the business logic; b) leave it there and reduce the number of classes you maintain.

The Fine Line between Fat and Voluptuous... Components

...this time it is not the J Lo's booty. Actually, as much as I've looked I haven't found a good set of recommendations about how to structure components in an ESB. There's some stuff from the SOA guys, but they have some ill gotten assumptions that all the invocation between services has to be remote and marshaled through XML, which is not necessarily the case for a single application using ESB as a platform.

I am in no way authority on EAI, but here is my attempt at defining some guidelines (take them with a big lump of salt):

  • Multiple components make your application complex. A good starting point for a new application is sticking everything in one component and extract components as necessary using the following guidelines. In general I've found it easier to split components than to merge components (Java's method-invocation semantics is less expressive than ESB routing).
  • If you need to checkpoint your processing, consider splitting the stages into different components and separating them with durable queues. The queues could be either JMS or Mule VM queues with persistence enabled. With proper use of transactions, this would allow you to survive application crash or do a failover (you still need to take care of any non-transactional endpoints). Alternative is to keep using one component and checkpoint using a distributed transactional cache configured with redundancy.
  • If you need to control the resources allocated to certain part of the processing, you can extract it in a separate component and change the number of workers using a SEDA approach (check also this article; the Scatter-Gather and Composed Message Processor patterns). Another valid approach is to use a compute grid like GridGain, giving you more functionality, but increasing the total complexity of the application. We should also mention GigaSpaces, providing a platform based on JINI and JavaSpaces with good integration capabilities.
  • When parts of the processing have different state lifecycle. It is best illustrated by example: we receive a stream of events on an inbound endpoint (let's call it 'control stream'). Each event has a payload consisting of condition and a 'data stream' endpoint address . On certain conditions, we want to start receiving and processing the events on the data stream in a way which requires independent state. The implementation would be to have one component that would monitor the control stream and keep track of the registered processors, creating new ones as necessary. The processors are concerned only with their data stream of events and each of them has separate state. A clumsy alternative (antipattern?) is to have a single component subscribed for all data streams and use a cache to lookup the state based on some key derived from the incoming event. A possibly better alternative for this specific case is to embed an ESP processor like Esper
  • If you want to extract a component to reuse it, consider extracting it as a domain object first. If it encapsulates processing and is ESB aware, then perhaps it's better off as transformer, router or agent (using Mule terminology here). So far I haven't had the need to reuse a business-logic component between projects.
  • If two components are tightly coupled and do not take advantage of any ESB-supplied functionality on the connection between them, consider hiding them behind a single Java facade. It's easier to unit test a POJO than to integration-test a component.

As always, any opinions are welcome.

Sunday, May 25, 2008

Transformations using Domain Adapters

Actually, this entry started as a reply on the Mule-dev mailing list.

...

What caught my attention was your transformation challenge. Specifically, how you decided to have a less anemic domain model and move transformations there instead of dedicated transformers (hope I didn't misinterpret it). Could you shed more light on this move? This could be an interesting pattern for some cases.

Andrew

Well, the idea is that your domain uses adapters to wrap the source data and the accessors perform the transformation in place. Since it's usually a straight mapping, we haven't found the need to cache the values.

The mutators also store directly to the underlying source bean, except in the cases where the value is derived from multiple input fields and updating them would break some consistency rules (actually in such a case, it would be better to avoid providing accessor if you can). A third approach is to have a map for changed properties and have all your accessors check there first and all mutators write there. This way you don't have to do a deep copy when you move the message using the VM transport.

The technical part is that there is a transformer, which has its source and output classes configured in the Mule configuration (I've had to add a custom setter for the source class). In the transformer initialization, it resolves a constructor of the output class, that takes a single instance of the source class as argument. The transformation itself is invoking the constructor with the payload. Note that the specified output class has to be a concrete instance in this case. Perhaps I could have done something similar using expressions but I like the type safety of this approach (if one of the classes is missing it blows at runtime).

Pros:
  • You can easily trace why the data is the way it is.
  • Adding new field requires changes to only one class (the adapter).
Cons:
  • At least the first layer of adapters is coupled to your source objects (if you use regular transformers, the transformer clearly decouples the src and output models). I would advice putting thest in a separate packages.
  • Needs better regression testing. Usually one catches a good number of breaking data changes in the transformation step. Since we transform on demand, this means that you either need bigger unit test or might have problems go unnoticed until integration testing
  • You lug a lot of data around, I can imagine that the serialization and cloning overhead could become prohibitive. In such cases you can have a method like Adapter.pruneStuffIDontNeed() that removes the parts of the input message that have not been used until now (you also need to track them).

Saturday, May 24, 2008

An Integration Story or 5 Ways to Transform a Message

It all started when we decided to replace Moxie with Devissa*. Moxie was a decent system and it had aged well, but its years had started to show. The rigid data schema, the inflexible order representation, the monoloitic C++ server... Don't get me wrong, it was and still is working great, but with the time we realized that we need something more. Something that would let us define the way we do business instead of having us change the business to fit in its model.

* All names have been changed to protect the innocent

The global roll out of Devissa looked like a good opportunity to bring in a more capable trading system. Devissa itself, was a huge beast, composed of hundreds of instances of several native processes running with variety of configurations, held together by TCL code, cron jobs and a templated meta-configuration.

The Moxie communication protocol was simple - fixed length records sent in one direction, 32 bit status code in the other, over a TCP socket (actually 2 sockets - uplink and downlink). Devissa was much more complex - the messages were framed using XML-like self-describing hierarchical format (logically it was the standard map of strings-to-arrays of maps... ending up with primitive values at the leaf nodes). The session level protocol was simple and luckily there was a java library for it (I'll bitch about it some other time). On top of the sessions, sit a bunch of application level protocols, each with different QoS and MEP. There is also a registry, authentication service and a fcache/replicator/database/event processor thingie that sits in the center, but I am digressing.

I'm actually started this article to share some interesting stuff I learned while we migrated the order flow from Moxie to Devissa. The phase-zero was to make a point to point integration Devissa to Moxie using the FIX gateways of the respective products, routing orders entered into Devissa to Moxie, so the traders could work them in the familiar Moxie interface. It allowed us to receive flow from other offices which were already on the Devissa bandwagon and it was great because we didn't have to code data transformations and behaviour orchestration logic - it all 'just worked'.

The next task was to make sure that we can trade on Devissa and still be able to produce our end-of-day reports from a single point. Right then all reporting was done from Moxie, so what seemed to make most sense was to capture the reportable events from Devissa and feed them back to Moxie. I'll spare you the BA minutae for now.

As we were looking for a suitable base for creating a platform on which to build various applications around Devissa, I shortlisted a couple of ESB solutions (although it's an interesting topic, I won't talk about "what's an ESB and do I need one"). I looked at Artix, Tibco, Aqualogic, ServiceMix and Mule. I found that Artix ESB was great, Artix DS looked like a good match for our data mapping needs, the only thing I was concerned was the cost. Before I get in contact with the vendor, I asked my managers about our budget - thay replied with almost surprise that we don't know - if it was good and worth the money we might try to pitch it to the global architecture group - in other words, commercial product was not really an option. This ruled out pretty much everything, leaving ServiceMix and Mule (if I was starting now I would also consider Spring Integration). I read a bit about JBI. I tried to like it, I really did... still I couldn't swallow the idea about normalizing your data on each endpoint and being forced to handle all these chunks of XML flying arround. At that time Mule looked like the obvious answer for OS ESB.

The first thing I had to do was to build custom transport for Moxie and Devissa. That took about 2-3 days. They didn't have any fancy features (actually they barely worked), but I was able to receive a message from one and stuff a message in the other. During the following year both transport evolved a lot, ending up with full rewrite last month, porting them to Mule2 and adding goodies like container-managed dispatcher threading, half-sync support, support for all Devisa application protocols and others.

The second phase was to build a neutral domain model as described in the Eric Evans's "Domain Driven Design" which I had read recently. Then I wrote two transformers - Devissa2Domain and Domain2Moxie, implemented a simple POJO with about 15 lines of real code and voila - all our Devissa orders and Executions appeared in Moxie. Forking the flow to a database was really easy, since I could use the Mule JDBC connector and it took only 10 lines of config. Storing the messages in XML was also easy with the Mule XStream transformer and the Mule File connector. The world was great.

Not really. It turned out that the DB storage and the file-based audit were not real requirements, so we cut them really quick (or perhaps they made the first release). Soon, during UAT, it turned out that even though the the BAs had created quite detailed requirements, they didn't match what the business wanted. Even worse - the business itself wasn't sure what they wanted. We were going through a few iterations a day, discovering more data that needs to be mapped, formats that need to be converted, vital pieces of information that were present in one model and not in the other and they had to be either looked up from static table or calculated from couple of different fields and sometimes ended up stuck in a field that had different purpose, which we were not using right now.

During all this time, the domain model was growing. Each new piece of information was captured clearly and unambiguously in a Java bean with strongly typed properties, validation and stuff. We went live on December 14-th. On the next day the system broke. We kept tweaking the business logic for quite some time and for each tweak, there were always three places to change - the domain model, the inbound transformer and the outbound transformer.

One day I decided to see what would it be if we drop the domain model altogether and replace the inbound transformer with isomorphic conversion from the Devissa data classes to standard Java collections and then use a rule engine to build the outgoing Moxie message. Enter Drools. The experiment was success - in a couple of days, I was able to ditch my domain model (which has grown to be so specific to the application that it wasn't really neutral any more). Drools was working fine, though I had the feeling that something was wrong... I never asserted, nor retracted any facts in my consequences - I was abusing the Rete engine. Actyally, all I was doing was a glorified switch statement.

While I was at it, I decided to ditch Drools as well and use MVEL - one of the consequence-dialects of Drools, which turned out to be a nice, compact and easy to embed language. MVEL is designed mainly as expression language, though it has control-flow statements and other stuff. With MVEL, all my transformation fitted on one screen and had the familiar imperative look and feel, but without the cruft. I was able to plug some Java functions using the context object, which allowed me to hide some ugly processing; and the custom resolvers allowed me to resolve MVEL variables directly from the Devissa message and assign them directly to the properties of the Moxie message beans.

Some time after that, for different project, building on the same foundation, I decided to see if I can infer an XML schema from the XML serialization of the Devissa messages. After some massaging I used that schema to generate the domain model using JAXB and tried to see how it feels. It was a disaster. A typical Devissa message has more than 50 properties (often more than 100). Usually you need 10-20 of them. Alsi, the generated property names were ugly. Even after conversion from CONSTANT_CASE to camelCase, they were still ugly. The automatically generated beans was practically unusable, the XML looked not-human-editable, the XSD was not adding any real value since it lacked any semantic restrictions, so the whole thing felt like jumping through hoops. In the end I dropped the whole JAXB idea and went with MVEL again.

3rd time lucky, beginning of this March, I started a new project. This time I again decided to try a new approach - in the inbound transformer, I was wrapping the raw Devissa message in an adapter, exposing the fields I need as bean properties, but carrying the full dataset of the original messages. It works well. One particular benefit is that you can always look at the source data and see if there is anything there that might be useful.

In conclusion I'll try to summarize:

  • Neutral model plus double translation can yield benefits when the domain is well known, especially if it is externally defined (i.e. standard). On the other hand it's a pain in the ass to maintain, especially if the domain objects change frequently.
  • Rule engines are good when you have... ahem, rules. Think about complex condition and simple consequence. Actually, in the original Rete paper, the consequences are only meant to assert and retract facts. Changing an object in the working memory or doing anything else with side-effect behind the engine's back is considered a bad practice at best or (usually) plain wrong. Even when using fact invalidation (truth maintenance), it has big performance impact.
  • Direct mapping using expression language works well, especially for big and complex messages. The scripts are compact and deterministic, which makes them maintainable. You might need to write your own variable resolvers and extend the language with custom functions. Also, debugging could be a nusance, but if you keep your control-flow to minimum and use plugged Java functions, it's quite OK.
  • Adapters are a middle ground between double translation and direct mapping. They tend to work well to provide internal representation for the application, you can also stuff some intelligence in them without worrying that somebody might regenerate them. With a bean mapping framework like Dozer you can even automate the transformation to the output datatype, though for many cases that would be overkill (sometimes 200 lines of straight Java code are more maintainable than 50 lines of XML or 10 lines of LISP).
  • Xml works well if your output format is XML; if you need to apply transformations with XSLT or render it using XSL:FO. As we know, you can run XPath on bean and collection graphs using JXpath; also any expression language can provide sililar capabilities.

Next time, I'll write about component decomposition, content-based routing vs coarse-grained components and how to decide whether to do the transformation in a component or in a transformer.

About Me: check my blogger profile for details.

About You: you've been tracked by Google Analytics and Google Feed Burner and Statcounter. If you feel this violates your privacy, feel free to disable your JavaScript for this domain.

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 Unported License.