Dimitar's Blog

Sunday, May 25, 2008

Transformations using Domain Adapters

Actually, this entry started as a reply on the Mule-dev mailing list.

...

What caught my attention was your transformation challenge. Specifically, how you decided to have a less anemic domain model and move transformations there instead of dedicated transformers (hope I didn't misinterpret it). Could you shed more light on this move? This could be an interesting pattern for some cases.

Andrew

Well, the idea is that your domain uses adapters to wrap the source data and the accessors perform the transformation in place. Since it's usually a straight mapping, we haven't found the need to cache the values.

The mutators also store directly to the underlying source bean, except in the cases where the value is derived from multiple input fields and updating them would break some consistency rules (actually in such a case, it would be better to avoid providing accessor if you can). A third approach is to have a map for changed properties and have all your accessors check there first and all mutators write there. This way you don't have to do a deep copy when you move the message using the VM transport.

The technical part is that there is a transformer, which has its source and output classes configured in the Mule configuration (I've had to add a custom setter for the source class). In the transformer initialization, it resolves a constructor of the output class, that takes a single instance of the source class as argument. The transformation itself is invoking the constructor with the payload. Note that the specified output class has to be a concrete instance in this case. Perhaps I could have done something similar using expressions but I like the type safety of this approach (if one of the classes is missing it blows at runtime).

Pros:
  • You can easily trace why the data is the way it is.
  • Adding new field requires changes to only one class (the adapter).
Cons:
  • At least the first layer of adapters is coupled to your source objects (if you use regular transformers, the transformer clearly decouples the src and output models). I would advice putting thest in a separate packages.
  • Needs better regression testing. Usually one catches a good number of breaking data changes in the transformation step. Since we transform on demand, this means that you either need bigger unit test or might have problems go unnoticed until integration testing
  • You lug a lot of data around, I can imagine that the serialization and cloning overhead could become prohibitive. In such cases you can have a method like Adapter.pruneStuffIDontNeed() that removes the parts of the input message that have not been used until now (you also need to track them).

Saturday, May 24, 2008

An Integration Story or 5 Ways to Transform a Message

It all started when we decided to replace Moxie with Devissa*. Moxie was a decent system and it had aged well, but its years had started to show. The rigid data schema, the inflexible order representation, the monoloitic C++ server... Don't get me wrong, it was and still is working great, but with the time we realized that we need something more. Something that would let us define the way we do business instead of having us change the business to fit in its model.

* All names have been changed to protect the innocent

The global roll out of Devissa looked like a good opportunity to bring in a more capable trading system. Devissa itself, was a huge beast, composed of hundreds of instances of several native processes running with variety of configurations, held together by TCL code, cron jobs and a templated meta-configuration.

The Moxie communication protocol was simple - fixed length records sent in one direction, 32 bit status code in the other, over a TCP socket (actually 2 sockets - uplink and downlink). Devissa was much more complex - the messages were framed using XML-like self-describing hierarchical format (logically it was the standard map of strings-to-arrays of maps... ending up with primitive values at the leaf nodes). The session level protocol was simple and luckily there was a java library for it (I'll bitch about it some other time). On top of the sessions, sit a bunch of application level protocols, each with different QoS and MEP. There is also a registry, authentication service and a fcache/replicator/database/event processor thingie that sits in the center, but I am digressing.

I'm actually started this article to share some interesting stuff I learned while we migrated the order flow from Moxie to Devissa. The phase-zero was to make a point to point integration Devissa to Moxie using the FIX gateways of the respective products, routing orders entered into Devissa to Moxie, so the traders could work them in the familiar Moxie interface. It allowed us to receive flow from other offices which were already on the Devissa bandwagon and it was great because we didn't have to code data transformations and behaviour orchestration logic - it all 'just worked'.

The next task was to make sure that we can trade on Devissa and still be able to produce our end-of-day reports from a single point. Right then all reporting was done from Moxie, so what seemed to make most sense was to capture the reportable events from Devissa and feed them back to Moxie. I'll spare you the BA minutae for now.

As we were looking for a suitable base for creating a platform on which to build various applications around Devissa, I shortlisted a couple of ESB solutions (although it's an interesting topic, I won't talk about "what's an ESB and do I need one"). I looked at Artix, Tibco, Aqualogic, ServiceMix and Mule. I found that Artix was great, the Artix DS was a pretty good match for our data mapping needs, still when I asked about our budget, I got that typical surprised look, which made it clear that we don't plan spending on commercial licenses. This ruled out pretty much everything, leaving ServiceMix and Mule. I read a bit about JBI. I tried to like it, I really did... still I couldn't swallow the idea about normalizing your data to XML on each endpoint and being forced to handle all these chunks of XML flying arround. At that time Mule looked like the obvious answer for OS ESB. Right now, if I was starting new development, I would also consider Spring Integration.

Again, I won't focus on mule (let me know if you want me to). The first thing was to build custom transport for Moxie and Devissa. That took about 2-3 days. They didn't have any fancy features (actually they barely worked), but I was able to get a message from one and stuff a message in the other. During the following year these evolved a lot, ending up with full rewrite last month, porting them to Mule2 and adding goodies like container-managed dispatcher threading, half-sync support, support for all Devisa application protocols and others.

The second phase was to build a neutral domain model as described in the Eric Evans's "Domain Driven Design" which I had read recently. Then I wrote two transformers - Devissa2Domain and Domain2Moxie, implemented a simple POJO with about 15 lines of real code and voila - all our Devissa orders and Executions appeared in Moxie. Forking the flow to a database was really easy, since I could use the Mule JDBC connector and it took only 10 lines of config. Storing the messages in XML was also easy with the Mule XStream transformer and the Mule File connector. The world was great.

Not really. It turned out that the DB storage and the file-based audit were not real requirements, so we cut them really quick (or perhaps they made the first release). Soon, during UAT, it turned out that even though the the BAs had created quite detailed requirements, they didn't match what the business wanted. Even worse - the business itself wasn't sure what they wanted. We were going through a few iterations a day, discovering more data that needs to be mapped, formats that need to be converted, vital pieces of information that were present in one model and not in the other and they had to be either looked up from static table or calculated from couple of different fields and sometimes ended up stuck in a field that had different purpose, which we were not using right now.

During all this time, the domain model was growing. Each new piece of information was captured clearly and unambiguously in a Java bean with strongly typed properties, validation and stuff. We went live on December 14-th. On the next day the system broke. We kept tweaking the business logic for quite some time and for each tweak, there were always three places to change - the domain model, the inbound transformer and the outbound transformer.

One day I decided to see what would it be if we drop the domain model altogether and replace the inbound transformer with isomorphic conversion from the Devissa data classes to standard Java collections and then use a rule engine to build the outgoing Moxie message. Enter Drools. The experiment was success - in a couple of days, I was able to ditch my domain model (which has grown to be so specific to the application that it wasn't really neutral any more). Drools was working fine, though I had the feeling that something was wrong... I never asserted, nor retracted any facts in my consequences - I was abusing the RETE engine. Actyally, all I was doing was a glorified switch statement.

While I was at it, I decided to ditch Drools as well and use MVEL - one of the consequence-dialects of Drools, which turned out to be a nice, compact and easy to embed language. MVEL is designed mainly as expression language, though it has control-flow statements and other stuff. With MVEL, all my transformation fitted on one screen and had the familiar imperative look and feel, but without the cruft. I was able to plug some Java functions using the context object, which allowed me to hide some ugly processing; and the custom resolvers allowed me to resolve MVEL variables directly from the Devissa message and assign them directly to the properties of the Moxie message beans.

Some time after that, for different project, building on the same foundation, I decided to see if I can infer an XML schema from the XML serialization of the Devissa messages. After some massaging I used that schema to generate the domain model using JAXB and tried to see how it feels. It was a disaster. A typical Devissa message has more than 50 properties (often more than 100). Usually you need 10-20 of them. Alsi, the generated property names were ugly. Even after conversion from CONSTANT_CASE to camelCase, they were still ugly. The automatically generated beans was practically unusable, the XML looked not-human-editable, the XSD was not adding any real value since it lacked any semantic restrictions, so the whole thing felt like jumping through hoops. In the end I dropped the whole JAXB idea and went with MVEL again.

3rd time lucky, beginning of this March, I started a new project. This time I again decided to try a new approach - in the inbound transformer, I was wrapping the raw Devissa message in an adapter, exposing the fields I need as bean properties, but carrying the full dataset of the original messages. It works well. One particular benefit is that you can always look at the source data and see if there is anything there that might be useful.

In conclusion I'll try to summarize:

  • Neutral model plus double translation can yield benefits when the domain is well known, especially if it is externally defined (i.e. standard). On the other hand it's a pain in the ass to maintain, especially if the domain objects change frequently.
  • Rule engines are good when you have... ahem, rules. Think about complex condition and simple consequence. Actually, in the original RETE paper, the consequences are only meant to assert and retract facts. Changing an object in the working memory or doing anything else with side-effect behind the engine's back is considered a bad practice at best or (usually) plain wrong. Even when using fact invalidation (truth maintenance), it has big performance impact.
  • Direct mapping using expression language works well, especially for big and complex messages. The scripts are compact and deterministic, which makes them maintainable. You might need to write your own variable resolvers and extend the language with custom functions. Also, debugging could be a nusance, but if you keep your control-flow to minimum and use plugged Java functions, it's quite OK.
  • Adapters are a middle ground between double translation and direct mapping. They tend to work well to provide internal representation for the application, you can also stuff some intelligence in them without worrying that somebody might regenerate them. With a bean mapping framework like Dozer you can even automate the transformation to the output datatype, though for many cases that would be overkill (sometimes 200 lines of straight Java code are more maintainable than 50 lines of XML or 10 lines of LISP).
  • Xml works well if your output format is XML; if you need to apply transformations with XSLT or render it using XSL:FO. As we know, you can run XPath on bean and collection graphs using JXpath; also any expression language can provide sililar capabilities.

Next time, I'll write about component decomposition, content-based routing vs coarse-grained components and how to decide whether to do the transformation in a component or in a transformer.

Saturday, December 1, 2007

Obfuscating the GUI

When I was working on mobile applications, obfuscation was mandatory part of the build. When every byte counts, you can not afford to have long variable names or carry extra stuff if it's not critical to the application functionality. In fact we didn't really care about the actual obfuscation (it's quite difficult to take an app out of the phone anyway, and even then, the success of a mobile game usually does not depend on some top-secret algorithms). Back then it was all about jar size.

The other day, I got the task to obfuscate an application that we wanted to ship to external client. The app was a SWT GUI, making use of reflection, runtime generics and runtime attributes. Also, the idea was to merge all libraries in the app jar and wrap everything in a native launcher. First I tried to merge the JARs. There was a small issue with the order of merging, since one of the libraries needed some file in META-INF, which existed in more than one jars, but overall no major problems (good that we didn't use OSGI).

Next step was the obfuscation. Obfuscating a moble app is pretty straightforward - you define all the library interfaces as seeds and let the obfuscator do the rest... errr, I guess that wasn't very clear, perhaps I should step back and take a look at ProGuard (my weapon of choice when it comes to free obfuscators), but the principles should apply to most of the products on the market.

The ProGuard obfuscation consists of a couple of stages:

Shrinking

Starting from a specified seed classes or methods, analyze the control flow and remove all the reachable code. The different obfuscators have different ways of specifying the seeds, the simplest ones being "keep everything which is not part of my source" (this is actually enough for a mobile application) of "keep everything". You also need to include here any class which is accessed by reflection only (think plugins and DI), native methods, classes accessed exclusively from native code, classes used as default values of annotation attributes unless you always specified a proper value, etc.

The shrinking also removes all the attributes from the classes fields and methods. If this doesn't mean much to you, you are not alone - one usually doesn't think about what's in a class until things start breaking (and break they did).

The first problem was that all stacktraces did not contain line numbers. That was actually easy to fix - just keep the LineNumberTable attribute and replace the SourceFile and SourceDir with fixed string - both are quite easy with ProGuard.

Next problem was that the DI container could not read the generic attributes from the collections and was sticking inside strings instead of URLs. Again - the Signature attribute contains the information used by the runtime generics reflection.

Then I found that none of my runtime annotations were kept. After some time spent staring dumb at the JVM spec (Chapter 4), I learned that the annotations are kept in anoher set of attributes - namely RuntimeVisibleAnnotations and RuntimeVisibleParameterAnnotations. The annotation default values are kept in an AnnotationDefault attribute of the corresponting method in the annotation class (or interface if you prefer) - you can strip these if you specify explicit values for all annotations.

There were also some attributes related to enums, but it looks like they are not used at run time.

Optimization

Not really sure what it does exactly. I have seen it reduce the number of methods, but it has really only two settings - "optimize" (yes/no) and "number-of-passes". I guess that each pass does one level inlining if a method meets certain criteria, but it might also do many other whole-program optimizations.

One thing which might be interesting is a profile-guided optimization like the Intel C++ compiler, where the optimizer would first instrument your classes, adding probes to your bytecode. Then you would run your app a couple of times to generate execution profiles and then optimize your app using them. Of course that's partly what the Hotspot already does, but not everybody uses Hotspot and in any case it wouldn't hurt if the code takes the right branch without jump in the majority of the cases.

Another possible profile-guided optimization would be to identify the order of loading classes and separate them by that - the early loaded in one jar, the latter loaded in another and the barely used ones in third - it can reduce the classloading time (if you put them on the classpath in the right order) and combined with the Java Modules proposal can help one create slimmer applications where you can start the app with the minimal jar and the rest is streamed as you work.

Obfuscation

The goal of the obfuscation process is making your code more difficult to decompile. Please note that I didn't say "impossible" - although decompiling obfuscated code exposes much less information and usually does not produce runnable Java code, it is perfectly possible for a motivated person to reverse-engineer obfuscated bytecode - it's just going to take longer. In the end it boils to the cost/benefit perception - if somebody thinks it will be cheaper to hack your product, they will - the obfuscation raises the bar to do it, but if you really care about the bottom-line, you might be better off with openavailable source and certain legal agreement (NDA, NCA and in some cases even patents might make sense).

Class/Method Renaming

The goal of the renaming is to make the classes and methods illegible. Usually this is achieved by changing the names to short identifiers (usualy one or two letters). This also reduces the size on disk and the perm-size by reducing the constant pool. If you specify the option to use lower and upper case letters for different class names, you can make the jar impoissible to extract on case-insensitive file systems as half of the classes would overwrite the other half. Another trick is to specify a dictionary of recommended identifiers, which contains all Java keywords. Since the keywords have meaning only in Java, but not in the bytecodes, a naive decompiler might produce funny uncompilable code (imagine for (int for=if; for<while.lenght; for++) else.add(while[for]);) - of course JAD handles this by recognizing and renaming the members, so I really consider this a wasted effort.

note: again, you will want to preserve the public interfaces, which is quite similar to the specification for the shrinking phase).

Flow Mangling

Since the decompilers recognize certain byte code patternsas result from a java statement, the obfuscator can reorder these, yielding semantically equivalent bytecode, which is impossible to map 1:1 to Java (JAD handles these with labels and goto). Also, I've seen obfuscated code using loops, breaks and exceptions to simulate IFs, but I'm not sure which obfuscator does these. (Zelix?)

String Encryption

I think it was Zelix KlassMaster that could substitute each string with encrypted version and insert code to decrypt them at runtime. This is very efficient measure as the strings usually give away a lot about what the code is doing (especially logging statements.)

Stack Map Generation (Preverification)

J2ME JVMs feature simplified class-loading mechanism which requires each method to declare how much stack space is it going to use in the worst case. The J2SE JVMs are smart enough to do this at runtime, stil this slows down the classloading. ProGuard can generate the correct StackMap attributes for the obfuscated code, so for slightly larger disk footprint you would get faster loading.

So that's about it. I figure that here is the place to throw in a couple of URLs:

  • ProGuard - free and quite decent. Nothing fancy.
  • yGuard - some people like its XML syntax. I don't think it's much different than ProGuard. The company producing it requires that you use it if you use their core product. It makes sense for them to want to take care about the actual protection of their IP.
  • JoGa was another tool that I used for J2ME, focused on bytecode optimization and had a nice GUI with many tweaks and gadgets. Unfortunately, it looks like the site is down.
  • Zelix KlassMaster - comercial - implements string encryption and advanced flow obfuscation (this is what JetBrains use for IntelliJ IDEA).

So all in all it took me about 6 hours to get everything obfuscated and in one jar. I had to disable the optimization and shrinking phase because I couldn't hunt down all the SWT JNI dependencies, still the resulting size was 2/3 of the original and the app was starting up noticeably faster.

The final touch was to wrap the single jar in Launch4J binary launcher, so the user would need resource editor to even get to the jar. Launch4J provides some small but nice features like JRE detection (from registry), JRE version checking, custom icon and Windows metadata.

Saturday, November 3, 2007

Build Tools

I had started writhng a post explaining how the Maven repository, artifact resolution and build lifecycle work, but I figured that I'm repeating the Maven Book and the Bullet-point Guide. Instead, think it would be more interesting to go down the memory lane and talk about how my build tools have changed through the years.

Why Build?

I stated programming on Apple][ using Basic, and at that time I stored my programs as source on a 5¼" diskette. Every time I wanted to run a program, I typed 'load MyProg <ENTER> run<ENTER>'. No compilation, no packaging, no complicated dependencies. It all Just Worked™.

Integrated Development Environments

Then, new machines came about, with bigger keyboards, bigger hard drives, bigger screens and (oh, horror) no built in Basic. At that time somebody told me that the Real Programmers don't use Pascal - somehow I failed to get the tongue-in-cheekness - it all made a lot of sense to me. Finally, after I couldn't find a Fortran compiler for IBM XT (actually Pravetz16), determined to become a Real Programmer I settled for QuickBasic. QB was very powerful - it had functions (which I thought were something like GOSUB but with a name) and you didn't have to put line numbers (still I didn't trust it, so I usually typed them in just in case). It also had a number of new commands, and almost none of the Apple II ones. Some time around 1992, I can't remember what happened, but I abandoned QBasic and joined the quiche-eating side of the power - I switched to Turbo Pascal 4.0.

Turbo Pascal was a big jump for me - it hade interesting new abstractions like units and scoping (the latter one being useless feature that only stops you from seeing your own variables), but one notable feature was that one program could be spread over multiple files. At that time text UIs were all the rage, so I had my own library for drawing animated windows, menus, etc. The whole thing was one file and I wrote wrote a couple of toy-apps, each of them having its own copy of the Library (notice the capital letter here). Every day I wanted to show my mom and dad to "what the computer can do" and I tried very hard to convince then start using my expense-tracking app (needless to say, my attempts were futile... My sister had a much better success at trying to get them to eat from her first cake).

At that time I didn't realize how much the IDE was doing for me - all I knew is that I press Ctrl+F9 and a couple of seconds later I get an EXE in the output directory. There was no packaging and I couldn't figure for the life of me why would anybody want to compile outside of the IDE.

Make?

As time went by, the IDEs changed (Turbo Pascal 5-7, Turbo/Borland C++ 2-4, Visual C++ 5-6), but my attitude stayed the same. Come summer 1999, I was working part-time as a developer in a small company and all of us ~20 developers were happily building release binaries with Visual Studio. There was a lone guy that tried to propose to use an obscure utility called make. It looked like you have to write yet another program that would do what the IDE does, but you needed to use obscure syntax, and call the "compiler" and "linker" directly, specifying every command-line parameter, and listing filenames manually - it was a lot more work. The benefit that he tried to put forward weren't very convincing either: "you can build from the command line!" - countered by "and why would you want to do that?" or "people that don't use Visual Studio can build the project", retorted by "are you crazy? Everybody uses Visual Studio." well, that guy was actually using VI... I think he didn't last very long there.

When I joined my next company I had to use Java. They were not using Visual Studio and in fact they didn't have a standard Java IDE. At that company they were building using make. Recursive make. Every directory had its own makefile, most of them containing only boilerplate code, including a toplevel template (quite annoying when you have to debug a build issue and count the number of '../' in the include), and to make it more interesting, some of the makefiles were not using the template, having their own goals, invoking OS commands, etc... Overall it worked (except when it didn't.) Most of the problems we had were related to incorrectly set environment variables and missing external programs. It was difficult to reason about the build process as the build files was spread all over the directory tree. In the end, one of the developers rewrote the whole build using Ant.

Ant

Ant worked. Much better than make. Looking back, I can say that this was because:

  • Ant is much less dependent on environment variables. In that case there was an build.properties file that everybody had to customize once and that was it.
  • Ant does not use OS commands. Everything an Ant build needs is either provided by the distribution or shipped with the source files (you don't have the habbit of plopping random jars in your $ANT_HONE/lib directory, do you?)
  • Ant's syntax is much more restricted than make. A syntactically invalid Ant script wouldn't run; a syntactically invalid makefile can erase your harddrive.
  • Ant was designed for Java, handling many common tasks right out of the box.

Many detractors say that Ant is too verbose and they are right. I personally don't have big problems with this as my editor usually autocompletes the tasknames and the attributes for me and warns me when I make a mistake. The modern Ant (1.6+) also allows you to factor your build fairly well by using includes, presetdefs and macrodefs. Actually Ant's biggest problem is that it is Turing complete. The target's dependency resolution, combined with the if and unless attributes is often abused to simulate control-flow statements, which pollutes the target namespace and complicates the dependencies. Too often the targets don't have good names because their only purpose is to hold a piece of code reused in some other targets (actually this use-case is served better by macrodefs, but many people still use targets). The assign-once-ignore-following semantics of the Ant properties is good for implementing overriding, but when we use Ant as a language given the lack of scoping, the namespace gets polluted really quickly and you might end up having strange interactions between unrelated targets.

As experiment I've tried useing Ant tasks from Jython. It works great - you have real variables, real control structures, the code is much more concise and you can use any other Java library you want. One downside is that for straight-forward builds (compile bunch of files, package them in a jar and zip them with some scripts) Ant is arguably easier to read, as there are fewer things one needs to be aware about. But the real dealbreaker is that you don't get any tool support - no IDE autocompletion, no syntax checking on-the-fly, no integrated build runners, nothing!

One thing I didn't mention is that Ant is very easy to customize - extend a class, provide some getters and setters, write your logic in the execute method and you are done! To use your custom task, you need to ship your jar with the build script and add an one-line definition to your build. If you have more tasks you can package them together with a simple descriptor and import them all at once using a namespace (this is called antlib).

Make!

Few years later, I was porting Java games for mobile phones from Doja to EzAppli and VSCL. I had common scripts for each of the platforms and every time I started a new port, I just had to tweak a template-script containing an import statement and a couple of properties. If I needed to port the same game for another platform, all I needed to change was the include statement. That was nice.

One day I got a port for a new system - it was called BREW and the API was in C. Initially I considered writing some Ant tasks to handle the native toolchain, but after some consideration, I read a couple of articles (see Recursive Make Considered Harmful) and decided to give make another try.

One of the useful make features is the dependency inference rules. This way, you can say that a *.c file generates *.o file. Then you just specify which *.o files your binary is comprised of and make will automatically guess your source files. If the source file is not newer then the object file, make is smart enough not to recompile it. The Java compiler does this by default (when using wildcards).

Make does not care about what commands you put in its goal definitions. That's why, out of the box it doesn't deal with transitive dependencies. To deal with this, most C compilers can generate a dependency listing in make format, which you can include in your makefile and regenerate when the dependency graph changes. In Java, the same thing can be achieved by using Ant's dependset task and some IDEs (like IntelliJ IDEA) can track all your dependencies (including transitives) as you type and recompile all impacted files.

In the end, I had a pretty well factored build system using make, requiring minimum configuration (much like the Java one), allowing for cross-compilation targeting x86 and ARM architectures, using different toolchains and everything. If I compare the Ant/makefile approach with the IDE, I'd say that the build scripts take more time to pay off. If you work on one project and your build is not complex and you don't need repeatable builds (because you work alone and your customer doesn't care), then the IDE might be a better proposition.

Shells, Perls and Pasta

Once again, I started on a new job and it turned out that in my department nobody uses a build tool. Everybody was usually building in their IDEs and copying straight to production or using ad-hock shell scripts or perl to build from the sources directly on the production box (the latter was rationalized as "this way we can fix bugs faster").

In the end, all scripts were simple compile+jar, sometimes even skipping the 'jar' step. They did get the job done and the business was happy. There were a number of things missing, like reproducible builds, reliable roll-back, etc. but it is a matter of tradeoff whether one wants to spend the necessarry time studying and implementing a build system or spend the same time implementing new functionality or fixing application bugs. There's nothing wrong with either way.

Maven

After spending some time working on an application with Ant Build from Hell, I was dreaming of a brave new world, where each application will be layed out in modules and packages with controlled dependencies, each module's build script would be simple and clean and one can focus on the actual application functionality.

Enter Maven (actually Maven2). After being burned by Maven1, I still thought that the ideas were good, and it was the actual implementation that sucked so bad. Maven2 is a new start, and a new chance to reinvent the wheel. The project developers have taken the working concepts from Maven1, pruned the ones that turned out to be a bad idea, and reimplemented everything from scratch. It's still not clear why did they decide to use their own DI container and classloader management (instead of say Spring and OSGI), but it works.

Maven has the chance to hit the sweet spot between a build-scripting tool and an IDE-style pure declarative build. In the core of Maven is the build lifecycle, which is just an abstract sequence of steps. Then, in your POM you define (or inherit) a packaging. The packaging defines a set of default plugins and executions. You can think about the plugins as a bunch of Ant-tasks (or 'goals' in mavenspeak), which are versioned together. The executions define parameters for the actual goal and are bound to a lifecycle phase.

Most of the parameters in a goal are optional, using sensible defaults. The defaults are either sensible constants or references to different parts of the POM. E.g. the compiler:compile goal would get the source directory from the POM reference ${pom.build.sourceDirectory} and use the constant "false" for it's fork parameter. All the POMs (or their parent POMs) in Maven2 inherit from a common "Super-POM". The common POM specifies many defaults (i.e. directory layout), so you don't need to, if you keep to the Maven Conventions. An important part is complying as much as possible to the standard Maven Directory Layout - it makes your life much easier.

There are some areas in Maven that are still rough. The release plugin is still quite limited (although there is work under way to implement features like release staging and binary promotion). There are a couple of annoying bugs in the assembly plugin, which are fixed on the head, but not released for more than an year. Some issues (like the explicit support for aggregator plugins) are being postponed for Maven 2.1 (which will probably ship around Q2-3 of 2008). But overall, I think it is an improvement.

Conclusions

So, I'm planning to use Maven2 for the time being and perhaps write a plugin or two for some tasks which it does not handle well (right now I'm still cheating, using the antrun plugin). I'm still (ab)using Ant for common scripting tasks like restarting a remote server through SSH connection, deleting files on remote machine, setting up a database table or deploying configuration files in remote environment. All these things do not fit in the build lifecycle and wouldn't benefit as much from writing Maven plugins for them. The main benefit in this case is that I get a simple, completely cross-platform scripting language, providing many common commands lacking from the normal Unix environment (btw did I mention that expect sucks?)

And finally, here are some more tools that I'm planning to check out:

  • scons and rake - build tools using Python amd Ruby respectively, each of them using the underlying platform and some clever code for doing build stuff
  • buildr - another Ruby tool that builds on rake, allows you to use Ant tasks, designed as a drop-in replacement for Maven2 (hopefully allowing for mixed environment).

Saturday, October 27, 2007

On Maven2

Maven is a tool with an interesting history dating back to 2001. In its first years it got deservedly bad reputation for being unstable, poorly documented and more or less experimental piece of work. The release of version 2.0 in 2005 fixed many of the early quirks and set right many of the short-sighted design decisions. After having some bad experience with Maven 1, I was weary to get on the M2 bandwagon, but when I moved to a new job in 2006 I decided to give it a go. So far there have been ups and downs, but I'm fairly happy with it. I still haven't abandoned all my Ant and shell scripts, but I find that I'm using Maven as a primary building tool for most of my projects.

The core proposition of Maven is that one should be able to declare what they are building in some sort of manifest file and the build tool should be able to figure how to build it. The manifest should contain only the information that is specific for the project and all the build procedures should be implemented as plugins. Each build should be related to exactly one artifact of certain type. The artifacts are stored in repositories (more about this later.)

In Maven parlance, the manifest file is called POM (that stands for Project Object Model). If a project adheres to a predefined filesystem layout, the actual XML one has to write can be very small. Here is a minimal example:

<project schemalocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd ">
  <modelversion>4.0.0</modelversion>
  <groupid>com.acme.foo</groupid>
  <artifactid>foobar</artifactid>
  <version>3.14-alpha</version>
</project>
This configuration is enough to enable Maven to build, package, install, deploy and clean your project. It would even generate the repots site if you run mvn site. That's the way I usually start my projects. Later, you can tack more elements as needed.

The top-level element defines the default schema for the POM. The schema has a version which has to match the modelversion element below. While the schema is not strictly necessary, it makes the POM editing much easier if you are using XML aware editor. It is well constrained and there are few extensibility points, so the autocompletion works very well. It also features annotations for each element, which means that if you use IntelliJ IDEA you can just press Ctrl+Q on any element and get instant documentation. The schema design is a bit annoying, but it is very regular and easy to understand:

  • No attributes - everything is an element.
  • No mini-languages - everything is an element and any custom textual notations are avoided as much as possible (though in many places they use URLs). This allows for simple parsing and processing.
  • If an element can be repeated more than once it is enclosed in container element that can appear only once. This ensures that all elements of one type are textually next to each other. Each container can contain only elements of the same type.
Overall, the POM bears a resemblance to an IDE configuration file and they serve the same functionality. Both Maven and IDEs runs a predefined build process, parameterized by the information in the project file (or POM). One major difference is that Maven is designed to be ran from the command line and also encapsulate all environment-specific factors into the POM and the settings.xml files. You can use Maven to generate project files for IDEA, Eclipse and Netbeans based on the information in the POM.

The machine-specific and user-specific configuration is specified in the settings.xml files. There are two of them, the machine-specific settings are stored under the $M2_HOME/conf directory and apply to all the users on the machine. Usually the contents is standardized within the team (internal repositories, proxy settings, etc.) In our company, this file is posted on the wiki where everybody can download it. Alternatively we could have built our internal Maven distribution with the file pre-included. The second settings.xml file resides under ~/.m2 and contains user-specific settings overriding the machine-specific. One can use the user-specific settings to keep login credentials, private keys, etc. On Unix machine, this file should be readable only by the user.

Though the POM is very flexible and can be tweaked to accommodate a number of different scenarios, it is very recommended to refrain from overriding the defaults and use the Maven conventions as much as possible. This way, new developers on the project can get up to speed faster and (important) it's much less likely that you get bitten by untested plugin 'feature'.

One of the thing that new users tend to dislike most is the standard directory layout. In brief, you have pom.xml in the root, and your files go under a directory called src. By default, all files (artifacts) generated during the build go under a directory called target which helps for an easy cleanup. Note that there is no 'lib' as all the libraries reside in the local repository (more about this in another post.)

So far so good, but then under source we usually have main and possibly test, integration and site directoryes and then under them we have java, resources and only there we put the actual source files. This means that we have at least 3 directory levels used for classification above our sources and if you jump between them using Windows Explorer or bash it makes for a lot of clicking/typing. On the other hand, this is the price one pays for the Maven's magic - each directory level means something to the plugins that build your project. E.g. the unit test goal knows that it should runt he tests under test and not the ones under integration. All the files in the resources directory are copied in the final JAR, while the ones under classes are not and so on and so forth.

This post became rather long, so I'll finish it here. Next week I'm going to cover Maven's dependencies management and repository organization. Again I'll try to talk more about "why's" and less about the "what's" that are already covered pretty well in the following tutorials:

Friday, October 26, 2007

Ninjava Presentation

First Post!

A month ago I volunteered to give a presentation at Ninjava (a Java User Group in Tokyo). At first it looked easy - the topic was "Java Project Lifecycle" and that's something I've been doing for the last two years, so I should have a lot to say, right? Well, this actually turned to be my first problem - after putting my thoughts on paper, I realized that there are too many things and there is no way that I can fit the whole talk in one hour.

Since everything looked relevant to the main topic and everything looked important, I figured that I needed to tighten the scope. I thought that I can focus on a minimal process and how we can use tools to automate the chores like building, releasing and configuration management. In the end, I put together this plan to build a small application, starting with a domain layer (actually that might be too big of a word for a class multiplying two BigDecimals), test, add command-line interface, release, add simple GUI, release, show how the whole thing can be split into 3 modules and packaged in different configs - GUI only, CLI only or both. Finally I was going to add a feature on the head and backport it to a stable branch. I planned to use the following tools:

  • Subversion - version control system similar to CVS.
  • Maven2 - project automation tool from Apache. The tool is build around the POM (Project Object Model). The POM is an abstract model, but in practice, it is usually an XML document which describes information about the project. The POM does not describe what actions can be performed on a project.
  • Artifactory - repository management server for Maven2. Usually it's configured as caching proxy in front of the internet repository, sindicating its contents with the internal repo. The deployment can be done either through WebDAV, HTTP PUT or a convenient web UI for manual deployment.
  • TeamCity (check out the EAP for the 3.0 release)- a continuous integration (CI) server from JetBrains. Apart from the usual watch-and-build functionality, it also features some interesting stuff like conditional commit (send the changes to the server which will commit them only if the build passes), duplicates analyzer and static analysis tool working on AST level, plugins for many different IDEs, etc.
  • Jira - an issue tracker. Actually I should clarify - the best issue tracker I've used (especially with the Green Hopper plugin and the JIRA Client GUI)
  • Confluence - a damn good wiki. In our department it also doubles as reporting tool.
  • Fisheye + Crucible - a server indexing your version control system, with web interface for browsing changesets, files, revisions, filtering by committer and full text search. Crucible is an add-on to Fisheye, that lets you select a bunch of files and create a code review ticket, adding inline notes. Multiple people can collaborate, posting replies to the notes and proposing changes until the ticket is resolved.

The main factor for choosing these tools was that I am currently using them either at work of in the development of the Mule JMX Transport (BTW, all of them have free licenses for open-source). The first rehearsal was one evening when I stayed after work and went through all the steps on my workstation - it took me about 40. Perhaps 10 mins I spent looking up documentation on the internet, so I thought it was not too bad. On the presentation day I took a day-off from work, so I can have the time to set up Daniela's laptop as my primary demonstration machine (as I had already installed all the server apps on my poor Kohjinsha) and everything was going fine, until I started the dress rehearsal. After I chased Daniela out of the apartment, I went through the already familiar steps this time talking as I would on the presentation. Oops... it turned out that 15 minutes were gone and I was still at the very beginning (and I didn't even manage in detail explain what I was doing). Attempt #2 - this time I consciously tried to only say what I am doing and not what I am achieving or why I was doing it. This time the speed was much better, but I found out that I was getting these ~30 sec pauses when switching active files or when searching for symbol, which were ruining the whole flow.

A quick look in Sysinternals' ProcessExplorer and ProcessMonitor showed that the working set of java.exe is 300mb and it's paging like crazy. I stopped some services, closed JiraClient and all the other applications and tried again. This time it was better - it was also clear that this is not going to work. I was at step 13 and I was well past 1 hour in the presentation... The wall-clock was showing 4pm... It was to late to change the scope, so the only thing that seemed to make sense was to just go out, start talking and try to provoke questions, which will help me focus on something.

I showed up at Cerego around 7:30 (after meeting Nikolay at Shibuya station) and while I was thinking which kind of excuse should I use, when Peter proposed that I could use one of the machines in the conference room. There was nothing to lose, so I settled to set up an environment in the remaining 30 mins - I downloaded Maven, IDEA, JiraClient, JDK, set up environment variables and just in case checked out the JMX Connector. I tried to set up Jira Client against the JIRA instance running on the Kohjinsha, but for some reason it didn't resolve the host name, so I decided not to bother. By the time I finished there were about 10-15 people in the room.

I started talking the usual stuff about how some tools work well in some situations and suck in others (except SourceSafe, which sucks in every situation); how the primary criterion in choosing a is that they should help us get the job done in less time/risk and blah, blah.. (more about this in a latter post) and then I moved on to my current environment and to Maven. Then Curt had a question about the POM, then another and then just one more... well, I guess I have to thank him for this. Overall, we talked a lot about Maven, went quickly through TeamCity, just mentioned Crucible and didn't even touch most of the other stuff.

Now I realize that even my cut-down scope was huge and it was impossible to cover everything in one hour with enough detail. The other fundamental problem was that I started the presentaion without having "something to sell". That's why the whole talk was lacking direction, so only when I started "selling" Maven it started taking shape. It would have been better if I had picked Maven from the very beginning, so I could prepare more examples and not waste time with general stuff. In the end, people said they liked it and may be some really did.

Yesterday it came too me that a blog would probably be a much better place for all the ramblings I wanted to share, so here we are. This is the first post I'll try to publish at least one post weekly (hopefully this time I won't lose interest after the first few weeks)

About Me: check my blogger profile for details.

About You: you've been tracked by Google Analytics and Google Feed Burner. If you feel this violates your privacy, feel free to disable your JavaScript for this domain.