Dimitar's Blog: 2010

Friday, October 22, 2010

Groovy Classpath Scanner

I just wanted a simple classpath scanner in Groovy - no library, no extra jars, no callback interfaces. I couldn't find any, so I wrote one. I'm posting it here and hopefully it would be useful to somebody. It is licensed under the established and permissive MIT License (if it precludes you from using it, let me know).

import java.util.zip.ZipFile
 
/**
 * <pre><code>
 * def cps = new GroovyClasspathScanner(packagePrefix: 'com.company.application')
 * cps.scanClasspathRoots(classLoader1) // optional
 * cps.scanClasspathRoots(classLoader2) // optional
 * ...
 * List<Class> classes = cps.scanClasses { Class it ->
 *    Event.isAssignableFrom(it)   ||
 *    Command.isAssignableFrom(it) ||
 *    it.isAnnotationPresent(MessageDescriptor)
 * }
 * </code></pre>
 */
class GroovyClasspathScanner {
  String packagePrefix = ''
  List<File> classpathRoots
 
  @SuppressWarnings("GroovyAssignabilityCheck")
  List<File> scanClasspathRoots(ClassLoader classLoader) {
    if (!classLoader) classLoader = getClass().classLoader
 
    def prefixPath = packagePrefix.replace((char) '.', (char) '/') + '/'
 
    def List<URL> urls = []
    for (URLClassLoader cl = classLoader; cl; cl = cl.parent) {
      urls.addAll cl.URLs
    }
 
    return urls
      .each { assert it.protocol == 'file' }
      .collect { new File(it.path) }
      .each { File it -> if (it.isFile()) assert it.name =~ /.*\.(?:jar|zip)$/ }
      .findAll { File it ->
        (it.isDirectory() && new File(it, prefixPath).exists()) ||
        (it.isFile() && new ZipFile(it).entries().find { it.name == prefixPath})
      }
  }
 
  List<String> scanClassNames() {
    if (!classpathRoots) classpathRoots = scanClasspathRoots()
 
    def classNames = []
    def collect = { it, String pathProp ->
      def normalizedPath = it[pathProp].replaceAll('[\\\\/]', '.')
      def packageRegex = packagePrefix.replace('.', '\\.')
      def classRegex = "\\.($packageRegex\\..+)\\.class\$"
 
      def match = normalizedPath =~ classRegex
      if (match) classNames << match[0][1]
    }
 
    classpathRoots.each {
      if (it.isDirectory()) {
        it.eachFileRecurse             { collect it, 'canonicalPath' }
      } else {
        new ZipFile(it).entries().each { collect it, 'name' }
      }
    }
 
    return classNames
  }
 
  List<Class> scanClasses(Closure predicate = { true } ) {
    return scanClassNames()
            .collect { try { Class.forName it } catch(Throwable e) { println "$it -> $e" } }
            .findAll { it }
            .findAll { predicate(it) }
  }
}

Saturday, September 18, 2010

Concurrent development

Background Story

That certain project was in its 4th year and after a few successful deployments in the US/LATAM regions, now the company was trying to push it to EU and APAC. The original developers were all in New York and over the last few months have been frantically working to adapt the product to EU's requirements (Asia was supposed to follow.) The scope was defined as 'whatever it takes', the methodology was a mix between 'trial and error' and bullying the users to contort the requirements to fit the delivered functionality. At the time, the EU team had realized that they would better spend their time staying on top of the changes and making sure the end product meets a minimum standard, so they were not really doing much development.

Time went on, requirements grew, scope shrinked. The project slipped past two deadlines and finally the Asia managers decided they need to take things into their hands and hire a local development team to avoid the communication gap that plagued the EU rollout and regain some control over the schedule. It was first time to have more than one people touching the code the core team was structured in a way that each component has an owner and the owner could do whatever they want. If you have a problem or need a change - ask the owner. The problem was not only that we were in different geographical location, in inconvenient timezone, but we were working on the same code, implementing requirements specified by separate BA teams, chasing schedules devised by separate project-management teams, and it all eventually converged in a common program-steering committee. I could go on, but suffice to say it was quite a mess - the bottom line is that moving from centralized sequential to distributed concurrent development models impose huge burden and the best advice one can give you would be "don't do it!".

Probably the biggest issue was that many people in the core team, just refused to change their way of working in order to accommodate our existence. Every second morning the trunk would not compile, often changes were checked in that prevent servers from starting, our changes were overwritten routinely because somebody's local modifications conflicted and they were unwilling to merge - you name it, we have it. The management layer was protecting them as "due to the years of accumulated experience, the productivity of the core team was much higher that ours, and the productivity hit they would suffer by addressing our petty complaints can not be justified in business terms". Luckily, there were some sensible guys and gradually we got to improve this, still I consider it one of the biggest organizational faults that for a long time the management efforts were focused on suppressing our complaints, rather than backing our suggestions on fixing the environment.

As the first QA delivery was approaching and the trunk was not giving any signs of getting more stable, we tried to think what can we do to stabilize the codebase. Some people said we should branch, others were weary of the cost of merging. The EU team had branched few months ago and all EU implementation were done on the branch and eventually (read 'sometimes') merged to the trunk. When the product was released in EU, they ended up with the problem how do they merge to the trunk. From what I hear it had been a terrible experience, including a lot of functionality rewrites, introduced bugs and regressions.

Knowing the EU problems and knowing that on one hand the trunk was still changing rapidly, on the other hand our requirements were dependent on code that was supposed to be delivered by US, we decided to branch, but keep developing on the trunk. All merges would be in direction trunk-to-branch and this would save us from the dreaded criss-cross merge conflicts. Since most of our problems to that date were with work-in-progress checkins, which we eventually wanted, we decided that we can treat the branch as a stable release line and trunk as unstable bleeding-edge code.

Unstable trunk + Release branch

I was tacitly elected as 'merge-master' and quickly I found myself following the same routine:

Every morning I would pull a list of all the unmerged commits and review them in a text editor. Then would move each commit into one of these categories:

WANTED - changes that are required or prerequisites for implementing our business functionality. These should be always merged.
BLOCKED - changes that we DO NOT want. These should be always marked as merged (no actual merging, just mark, so they will not appear in the list next time).
IRRELEVANT - changes that won't hurt us, but we don't strictly need them. We were merging these in the initial stages as keeping the branch close to trunk makes merging easier, as we got closer to the release, we flipped the policy to improve the stability.

When I merge or mark as merged the WANTED/IRRELEVANT/BLOCKED groups, I would put the category as a first word int he commit message. This made it easier to pick out the changes that were done directly in the branch (which should be kept to minimum and if necesarry ported manually to trunk). I didn't bother separating the individual changes, since the branch was not meant as a merge-source - this was saving me some time. Overall it was taking between 1 and 3 hours a day.
There would be a number of changes that didn't make it to any of the categories. For these I was contacting the comitter and following up. Often it was work in progress, sometimes after clarification they would be categorized on the next day. Usually I would post this communication as a tagged blog-entry in our wiki. There was a page displaying all the entries tagged in this way.

I found out that sorting the changes first by user and then by date simplifies the review significantly. Turned out that TextPad macros can be a very powerful tool for things like this.

The release branch worked well for some time, until a major feature for the next release was implemented on the trunk. We blocked it and ever since then, every commit that touched this dreaded component had to be hand-merged. Often, merging an one-line change resulted in tens of conflicts, so we resorted to rolling back the file in question and manually porting the change. The worst thing is that we tested the trunk extensively, but the change in our release-branch received only cursory examination until it reached QA.

Furthermore, once we reached the second phase of the Asia roll-out, our team split and started to work in parallel on three staged releases, which were supposed to deliver unrelated functionality within 2 months of each other startin 6 months from the date. This meant that we need better mechanism for dealing with divergent codebase and big changes in progress.

Exchange-trunk + Development & Release branches per stream

After taking a step back, we came up with a new branching scheme that satisfied all our requirements. For each pending project phase we would create two parallel branches - development and release (we would call the combination of two branches a 'stream'). In addition, we devised the following policies and procedures:

Developers always commit their code changes in the dev-branches.
Any code committed to the dev-branch MUST compile. If the branch is broken, people should not commit further unrelated changes until the CI says it's fixed.
Each commit in the dev-branch should contain work for a single feature. If there is certain code pertaining to two features, we pick one of them as primary and mark the other one as dependent in the issue-tracking system. Then, all the shared code goes to the primary feature and we know that we can not release the dependent on its own. It is not necesarry that the whole feature is committed in one go or that the dev-branch committed code actually works.
When we need some code from a different stream, we would wait until they publish it to trunk and only then we would merge from trunk to the dev-branch. Cross-stream merges are prohibited. We were calling this 'picking-up' the feature. Pick-up changesets should be marked in the commit message.
Each time we pick-up a feature, after we do the minimum conflict resolution, so the code works, we would commit the changeset immediately (that's the pickup changeset). This way, any additional enhancements, fixes. etc. will be committed in separate changeset, so it will be easier to merge them back to trunk later.
Once a feature is complete and dev-tested on the dev branch, all related changesets for that feature are merged as one consolidated changeset in the release branch. We call this 'feature-promotion'. This practice makes creating release notes relatively easy and allows us to do cool things such as rolling back the whole feature with one command.
When we promote a feature that has been picked from trunk, we immediately mark-as-merged this rel-branch commit into trunk to prevent double-merge conflicts. We would look if we have made any fixes on our branch and consolidate them into a single enhancement/bugfix changeset that will be merged directly from dev-branch to trunk (as in the the rel-branch we consolidate the pickup and enhancement changesets).
If QA finds that the feature did not work, we would add further bugfix changesets to the rel-branch, but we would strive to keep them to minimum.
When a release has passed QA, we would merge each feature-level commit that originated from this stream from the release branch to trunk ('publishing'). There it will be ready for picking up by the other streams (which will merge it in their dev-branch, promote it to release, etc).
For each release we would tag the release branch, since it was already stabilized. Bugfix releases were just further tags on that same branch. For urgent production changes, we would create a bugfix branch from the tag (happened only a few times).

Overall it worked well for us. Few months after we addopted this scheme I moved to another company, but I really hope the process is still useful and being improved. An interesting thing is that every time I explain this, the first reactions are along the lines of "does it have to be that complicated?" And while I can agree that complicated it is, I am still to find a simpler streategy that could work on this scale. Any ideas?

Friday, January 15, 2010

Setting up PuTTY to use public keys for authentication

I've looked on the internet for a quick step-by-step guide how to get PuTTY to use public key authentication with OpenSSH daemon and it took me some time to figure. I'm posting these instructions in case anybody else has the same needs.

Prerequisites

Make sure that your OpenSSH configuration (usually /etc/ssh/sshd_config) contains the following line:

PubkeyAuthentication yes

In my case (CentOS 5.4) it was disabled by default.

Also, you would need the full PuTTY suite which can be downloaded form here (get putty.zip).

Generating the key

This is a way to generate the key with Putty. Alternatively you can generate it with OpenSSH's ssh-keygen tool and convert it to PuTTY format.

Start PUTTYGEN.EXE
In the parameters box at the bottom of the window, choose type of key 'SSH-2 RSA', set the bit size to 2048.
Click the Generate button, move the mouse over the blank area until the progres bar fills up.
Enter your notes in the comment line (this is displayed to you when you use the key, you can change later).
Enter key-phrase, make it long and complex, write it down in a secure place or print it and hide it somewhere in your freezer.
Save the private key (*.ppk) in a reasonably secure filesystem location. Even if somebody gets access to your private key, they will still need your passphrase to use it.
Copy the text from the text box under the 'Public key for pasting into OpenSSH authorized_keys file:' and paste it on one line in a new file called authorized_keys (we'll use that later). The file should contain a single line terminated by Unix-style new-line and there shall not be an empty line after it.
Close PUTTYGEN.EXE

Associating the key with your Unix account

Login to your unix account
Create a .ssh directory under your home if it does not exist
Copy the authorized_keys file there
Do chmod 700 ~/.ssh ; chmod 600 ~/.ssh/authorized_keys

This needs to be done for each machine you are connecting to. In this case it helps if your homw is NFS mounted.

Using the key directly

Start PuTTY
Specify user@host in the 'Session > Host Name' field.
Specify the path to your private key file in the 'Connection > SSH > Auth > Private key file' box.
Click the 'Open' button at the bottom of the PuTTY settings dialog.
When prompted, enter your private-key pass-phrase and you will be logged in without entering your Unix password

Setting up Pageant to cache the decrypted private key

Let's look what we have done. The good thing is that our password does not travel over the wire and is not susceptible to man-in-the-middle attacks. The bad thing is that we used to enter the short and easy password of our Unix account, while now we have to enter the long and difficult pass-phrase of our key every time we establish a new Unix connection. In order to avoid this, we can use PuTTY Pageant which is SSH authentication agent (Unix equivalent is ssh-agent)

Start PAGEANT.EXE
Click on the computer-with-hat icon in your system tray.
Choose the Add Key option and pick your private key (*.ppk)
Enter your pass-phrase
Close the pageant dialog

From now on, when establishing SSH session Putty will try to use the decrypted key from Pageant first and then fall back to password auth if none of the keys match.

You can create a shortcut, starting Pageant and passing the paths to your keys as arguments. This will start Pageant and load the keys in one step, but you will still need to specify the pass-phrase every time you do this (typically after system restart).

Keep in mind that Pageant holds the private key in memory unencrypted. If anybody captures a heap dump of the process, they can get access to your private key without knowing the pass-phrase. That's why, you might want to stop the Pageant if you are not using it for a long time or if you shae the machine in multi-user environment.

If using Pageant, you might also check the Putty option 'Connection > SSH > Auth agent forwarding', which will allow you to use your key from the remote machine on which you are logged on.

How fast an SSD drive do you need

If you need Intel X25-E 32G SSD for 70% of the cheapest listed price (shipping extra), please let me know.

For a long time I thaught that the bottleneck of all builds was the HDD, so when I got my new notebook, the first thing I did was to add a spiffy extra Intel X25-E SSD hard drive to it. As I expected, the builds went much faster. To my surprise, the HDD throughput stayed fairly low during the build, which suggested that the benefit of SSD drives kicks in early and buying higher grade drives doesn't make much difference as the bottleneck moves to the CPU quite quickly. All this makes sense, considering that a typical application has hundreds to thousands of files and the HDD spends a lot of its time seeking rather than reading.

When I cleared the IntelliJ IDEA caches and opened the IDEA Community project (total 900MB, 66k files), during the indexing, one of the cores stayed pegged, the HDD read throughput did not exceed 4mb/s for 4 mins then for 1 minute it rose to top 25mb/s, avg I guess around 15mb/s. The write never exceeded 10mb/s, and for most part it was bellow 4mb/s, the last minute was between 5 and 7mb/s.

During initial compilation, the 2 cores of the CPU (2.53GHz T9400) were quite busy, staying above 80% all the time, the disk utilization during compilation stayed less than 4mb/sec with the ocasional peaks at 6mb/sec. The write peaks were 5mb/sec, for the most time bellow 1mb/sec.

At the end of the compilation, the index update took 50 seconds, with average read throughput ~10mb/s, peaking at 30.5mb/s, the write peaked at 7mb/sec. During that time the CPU utilization dropped around 50%, which only shows that IDEA's indexing is not using both CPUs.

The bottomline is - it's not worth byuing expensive SSDs for consumer usage - cheaper ones are just as good for the average home and software development workflow. Most of the applications do not involve transferring huge volumes of data and the slowness of the spinning-platter HDDs is mostly because of seek times and fragmentation. Expensive SDDs are warranted if you are working extensively with media files or are processing huge amounts of data on the disk.

Dimitar's Blog