01 July 2009

Interesting Web Service Bug

The other day we found an interesting bug in using a SOAP web service. Of course, it was only found after totally fouling things up for two days leaving two more days of things to clean up for the whole team.

I'll tell you up front what was wrong (but skip to the next paragraph is you want a mystery). The two ends of a web service (client and server) had different ideas about the allowable range of values that were allowable for an integral value passed back and forth.

The key thing you need to know is that our application makes a call to an operation on a web service at another company. We defined the service and had them implement to our WSDL. A list of data elements builds up in their system until we ask for it. The web service provides us with a block of those element that we then process.

When we request, data comes in a block along with a 'handle.' The handle is defined as an xsi:int element in the schema. We process the data. It takes about 5 minutes, usually. Then we call another operation on their web service, passing back the handle, to tell them that the block they gave us with that handle is done. They can remove it from the list of elements that are waiting to come to us.

Then we loop around and request the next block of data and do the same thing process again. When there is no more data to get, they return an empty block. When we see the empty block, we know there's no more data for today. So, we wait until tomorrow and start over.

The bug was rooted in a misunderstanding about what allowable values were for the handle. When they sent us a block, they got a unique number out of their database. I don't know what the range of integral values that data element allowed but it was larger that the Java 'int' that we were using to hold the handle. Further, I haven't looked up to see if an xsi:int in a schema in a SOAP WSDL has a defined range anyway. So I don't know who had the range wrong, if anyone. Truly, I don't care, but don't tell anyone.

So, everything ran fine for years. Literally!

Then, one day, they sent us a number larger than the largest allowable positive Java 'int'. Their datatype had been converted to a string in the XML of the SOAP message. It was just a sequence of digits. Our web service framework (Axis 1.3) converted it to Java 'int' and came up with a negative value. (The framework code converting it did one last mathematical operation that overflowed the signed 32-bit number.)

When we sent the value back to them, they didn't recognize it as any existing handle so they ignored it. (I suppose they might have logged an error or even started the red lights flashing all over the developer's cubes. I don't know.) The block of elements they sent us were not removed from the list. When we asked for the next block and got the same elements again.

This repeated processing of the same elements happened repeatedly and repeatedly and repeatedly for 17 hours at 10 minute intervals before we got it stopped. (Things like this never happen until you just left the office for the day.) It left us lots to clean up. (I'm just glad it didn't start happening on Friday evening.)

You might wonder about the results. Well, some people got confused for a few hours when they saw duplicate data. We got everything fixed to not show any bad stuff to the users quickly. Pretty soon we got our integer ranges aligned with the other company and the problem quit happening. A few days later it was just a bad memory.

Anyway, it seemed interesting to me that a simple thing like a misunderstanding of the data range of integral values in a web service hid away for years and then rose up to bite us in such a big way.

It just underscores the need to define more than the interface for a web service. You have to define what the data means and what are allowable values if you want to cover all the bases, so to speak.

It also underscores the difficulty in testing for this sort of inter-system bug. I'm not sure how one would create a "unit" test that would just test this sort of interface. There are two systems. Two sets of developers. Two companies. (Two countries, actually.) And little shared information on either system's internal workings.