April 2005 Archives

It's *your* photo, Ralph...

| No Comments

From the sublime to the ridiculous. It's not bad enough that record labels are trying to lock up CDs so you can't media shift. Now it seems you may not even have unfettered access to your own digital photos.

It seems that Nikon's proprietary NEF image format (their own flavour of RAW) actually encrypts photo metadata. Their explanation for this approach seems couched in marketroid terms, mostly devoid of rational reasoning.

They say that this encryption approach "protects the uniqueness of the file". What on earth does that actually mean? For some reason they need to ensure the "preservation of [Nikon's] unique technology".

But wait a minute. I buy a camera. I take a photo. The photo is mine. The data is mine. The f-stop, aperture, shutter speed, white balance, whatever settings I used to take the photo - all mine. So why is Nikon so intent on locking away this information in a proprietary format?

The files can only be read using a proprietary SDK. You have to apply to Nikon to obtain it, and must be authorised to use it. Otherwise - you can't read your own photos in the NEF format. (Yes, you can use JPG but the quality is reduced.)

How is this good for the consumer in any conceivable way? Only authorised "bona fide" developers are able to obtain the SDK. But what is the license? Will I be able to develop Free/Open tools (like a Gimp plugin) with it? I doubt it. What platforms does the SDK run on? Will I be able to port it to my own preferred operating system?

It seems that Nikon are simultaneously restricting the consumer's choice about tools and platforms, while locking them in to using certain "blessed" proprietary software to read their own data.

It's your photo, Ralph.

Update: It looks like a clever developer has reverse engineered the NEF white balance encryption. But as the article mentions, there is the potential threat of the DMCA now (which is what put Adobe off trying this themselves). It will be interesting to see how Nikon reacts to this.

Bug of the Year

| No Comments

I discovered the weirdest bug the other day...

I had built a new version of the imaging system, tested it and was ready to deploy it. It was working great on the development box, and it seemed ready to go. I downloaded the new version on the target machine, and fired it up, expecting to have only a few minutes downtime.

Floating point exception

Oops.

But it wasn't my fault, honest! Actually my first inclination was to blame one of the new packages I had installed. One of the new features was to plot some statistics, and I was ready to put it down to a conflict or some weirdness with the new plotting package. I checked everything three ways, and it was the right version, had all the right dependencies... It had to be in my code - but where?

I cranked up the debug level in all modules (and was very glad at this point that I had decided to use the logging framework liberally, and with some granularity, and leave it in the production system). I re-ran it and it seemed like the code was failing in a drawing routine for a custom widget. This was extra weird, as I hadn't changed that module for a while, so it was the same as was running perfectly before.

I then went in and resorted to the age-old technique of adding print commands at various stages of the code, to see how far it was getting. Sure enough, in the middle of the on_draw() method, in amongst some OpenGL calls, it was raising the SIGFPE. I should mention at this point that this module was coded in Python, so any floating point errors would raise a real Python exception and display a nice stack trace showing me precisely where my error occured. The fact that the error was the most generic message, obviously coming from the C runtime (and calling abort() rather rudely) made me wonder...

I stuck a premature return in the drawing code, and everything worked fine from then on. But certain other widgets, also using OpenGL, could be made to trigger the error under certain circumstances. Weirdness. Time to fire up gdb.

I ran Python under the debugger, and ran the program (which also includes a bunch of custom Pyrex modules, which were not above suspicion). The first thing that happens is a dump from a SIGFPE, but this was in the i830_dri.so module, in some __driCreateScreen() call. I had actually noticed this once or twice before, but it had never caused a problem and so I had blissfully ignored it. But now some alarm bells were going off.

I reverted the production machine to the previous release (thank goodness for Subversion and half-decent configuration management!) and went back to the office to track the problem down in detail on the test rig (which is identical to the production machine).

It turns out that the i830 video driver raises some floating point exceptions under certain OpenGL operations. AFAICT this is a bug, but it hadn't caused a problem before, and hadn't been a problem on the dev box (which, as it turns out, uses the i810 driver which is why I hadn't noticed it before). But why was it failing now and not before? The plotting library matplotlib I had installed needs a linear algrabra package to run, and it can use either Numeric or numarray. I chose the latter because it was newer. It seems that numarray actually installs an FP signal handler, so when I added it to the mix, it was picking up on the driver FPEs as well.

After much binary partitioning (and a lot of hackery) I narrowed down the problem to the essentials, then wrote the most basic code that demonstrates the problem. It is just a simple OpenGL program that displays a triangle, but if you so much as import the numarray module, you get a FPE in the drawing code. Now I have to figure out where to report the bug! (I suspect it is squarely the fault of the X i830 driver...)

The solution for the time being was to use Numeric for the linalg stuff, and I could put the plotting features back in. This would give me time to chase up the driver bug.

The moral of this story: always test on a test rig that is the same as the production machine!

Oh, and also - logging is your friend! It is definitely worth the extra few cycles to have granular, detailed logging available at the flick of a switch when things start going wrong. Use a proper logging framework, not just printf, so you can log to a file, syslog, or whatever.

Optimize this...

| No Comments

I'm working on an industrial image processing system, and it's coming together pretty well. I hadn't tried optimising anything up until now, as I wanted to get all the features implemented and do some profiling. "Premature optimisation is the root of all evil", as they say...

But as of the last few days, the system has reached the point where most of the key features are working on the imaging unit, certainly enough for meaningful performance tests. And since it is a near real-time system, performance has always been a key concern. A good excuse to start some hunting for some optimising opportunities...

Classifier

The target for a full processing cycle is 100-120ms, although I'm aiming for more like 90ms. Without any optimising whatsoever, the processing cycle was taking far too long at something over 200ms. One stage, the classifier, was taking just over 30ms per frame, which seemed excessive considering its primary role. I started looking at the main loop, and it became clear that there was a lot of extra code calculating stuff only used during calibration. Making that optional saved 5ms per frame.

Looking at the guts of the inner loop, there staring at me was the next obvious candidate: a function call, used to retrieve the YUV pixel (or YCbCr CCIR-601 if you prefer) components from the YUV:422 packed frame. I rewrote the pixel access as a macro, and got it down to 15ms. Nice, but I'm sure can do more.

At this point I realise that I haven't even tried the obvious thing of adding '-O3' to the compiler flags. I rectify this oversight, and this gets us down to 10ms.

I restructure the loop some more, and remove even the test for the calibration code from the inner loop, so all it is doing is the lookup table. I also remove a histogram calculation that I didn't end up needing to refer to, and now get it down to around 4ms.

I was reasonably happy at this point, as the code was still perfectly readable and modular, and I didn't have to resort to assembler, obfuscation or kinky tricks. There's one more thing I can do that might shave off a few cycles, but I had to move on to the next target.

Capture

We are using Firewire cameras, specifically the Basler 601fc. The state of IEEE 1394 support under Linux is somewhat in flux at the moment, and I wrote a layer to abstract away the details of Firewire camera control. This not only simplified the layers above, it made us insensitive to which particular underlying library we would ultimately use (currently libdc1394). It also made it nice and easy to write a Python wrapper (using Pyrex of course).

When I first wrote this fwcam layer, I used the simpler of two capture models, and once it was working, left it at that. But when I added a high resolution timer to profile the image capture code, I discovered that just the capture was taking over 40ms. This seeemed excessive too, so I went back to re-read the docs and code and see how I could speed things up. It turns out that there was a DMA version of the capture function, and although it required some extra buffering, it wasn't much effort to change the fwcam layer to take advantage of it.

I was rather shocked to see, once I had rebuilt the library, the capture time was down to under 2ms! I was expecting a big difference, but even this much was a surprise. Another nice feature is a non-blocking variant of the DMA capture call that could return immediately if no frame was available. This would prove useful in the async triggering protocol.

But wait - there's more...

There are many more opportunities, I'm sure. Through some careful analysis, strategic placement of some high-resolution (microsecond) timers, and restructing and refactoring of the code, I managed to shave around 26ms per frame in the first case, or about 8 times faster. But the most important thing is that the code is still readable and maintainable, not some mess of macros and funky shortcuts nobody else but me can grok.

Of course, I have been looking at the MMX extensions in GCC recently, so... :)

Why I love/hate Gtk+

| No Comments

There are many nice features in Gtk+, and recent versions are particularly noticeable for the good collection of clean, stylish widgets.

The API is pure C, chosen so bindings for other languages (such as Python, Haskell, Scheme, etc) are as easy as possible. (This would be more difficult with C++.) So the code is written in an "objecty" style, even though it is plain C. Since v2.0, there is even GObject, which has made possible all sorts of improvements.

The Python Gtk+ bindings are particularly well done (thanks to JamesH) and make writing Gnome apps in Python a breeze.

But there are certain anachronistic aspects of Gtk+ where some X cruftiness has been exposed. Based on my experience thus far, it seems to be mainly concentrated around the imaging model. X has a notorious imaging model, in part due to its client/sever model (which rocks) but also due to various creeping features over the years. This can be both a good and a bad thing, and unfortunately the documentation on this is not particularly helpful.

Since most of my work is in the area of image processing, I need to do a lot of graphics work, fast screen updates and that sort of thing. And every time I have to to graphics in Gtk+ I cringe, due to all the pfaffing about you have to do just to achieve something simple.

For example, to draw a coloured box, you need to create a GtkDrawingArea widget, then in its expose-event handler (why not "draw" instead of "expose"?). Inside, you need to get a context on the widget, then create a colour, then allocate the colour in the widget context, and only then can you actually draw something. So colour management is my first beef; it is way more effort than it should be. Ideally you should just get a context then set the active colour; if some allocation has to go on underneath, hide it from us! It is fairly safe to assume a 24-bit display these days, and if you don't have one, X should handle the dithering.

Then there's images. Or rather, there's Pixmaps and Images, and Drawables. Getting data into and out of these things is obscenely harder than it should be. We have been blessed with libtiff, libjpeg and libpng for ages now, but the docs seem to all embed ASCII-fied XPM bitmaps into the source - who does that anymore? Pixmaps and bitmaps have slightly different APIs, you have to convert between things to draw, and other stuff 'n' nonsense.

It should be fairly straightforward to allocate an image of a given resolution and bit depth, and be able to bit-twiddle directly in the buffer. Then when you're ready, blit it to the screen as fast as the hardware permits.

I've found it is far easier to use OpenGL for my custom drawing (specifically GtkGLDrawingArea) than deal with the messiness of the above. But even getting this working with Gtk+ is fraught with problems. For a start, you frequently need a fixed size canvas when doing imaging. But you can't allocate a widget with a fixed size in Gtk+ - you can request a size, or set a minimum, but a widget's parent is responsible for sizing its children, so you have to do some funky shuffling to get things sized and laid out properly.

Everything went Pear shaped

| No Comments

Last week, the fruit imaging system went live for the first time. It took a few days to iron out all the little deployment problems, and by Tuesday we had it up and running in the factory.

By the end of the second day, it had processed over 110,000 units. Judging by the stats I was collecting, the percentages look like they are tracking the expected values within 5-10%. Given the heuristics are still not finely tuned, it is going surprisingly well from what I can tell.

So a huge congratulations and thanks to Ray, James and Denys, the electronics and control systems wizards behind the project.

That is the first major milestone; a few more to come!

Validity Constraints

| No Comments

I was doing this stuff a few years ago: Using Annotations to add Validity Constraints to JavaBeans?. And it wasn't really new then either as best I recall.

We were building a simulation system for water flow, and various modules had parameters with different units, valid ranges, constraints and all sorts of things. So inspired by an earlier project I had worked on, I decided to add tags in the javadoc blocks to declare these constraints and so on. We then wrote a javadoc plugin to process them and generate extra code to handle the validation.

Gee, maybe I should have patented it?

It can be an incredibly useful way of doing things. I am a big fan of using declarative styles where possible. And now that Python has decorators, the possibilities are amazing...

Does not compute

| No Comments

I love reading stuff like this: 13 things that do not make sense (New Scientist).