Saturday, March 6, 2010

I hate daylight savings time

I just found one of the most mistifying bugs I had to deal with recently.

Once in a veeery long while, in very rare cases, the analysis of a Wikipedia page would break for no clear reason, after going through a few thousands revisions.
One of these was the page on Captain America.

I looked at the logs, and what did I see? I saw -apparently- an implementation of merge-in-sorted-oder not working. I had to merge the following lists of revisions, eliminating duplicates, in chronological order:

  • L1: 114223615 114213195 114177557 114172306 114154003 114153297 114149504 114146133
  • L2: 114213195 114177557 85922013 85906965
  • Result: 114213195 114223615 114213195 114177557 114172306 114154003 114153297 114149504 114146133 85922013 85906965

The revision 114213195 appeared twice in the results! This broke the analysis.

A mistake in my merge algorithm? No. It was just a few lines, tested it many times, stared at it even longer. No, the algorithm is fine.

A mistake in the data structure used to represent lists? I use my own data structure, called Vec, which is a functional implementation of extensible arrays that supports arbitrary element deletion and insertion. That data structure is used all over the place in my WikiTrust code, so it would be both serious, and strange, to find an error in it after such a long time it has been in use. Still, everything is possible, so I instrumented the library in detail. No, Vec worked fine; the problem was somewhere else.

A problem caused by memory corruption? I ran the code on both 32-but and 64-bit systems with different memory sizes, and it still broke in the same way, so this was unlikely.

Then I looked closely at what those revisions are, and why the merge behaved in a strange way, and I discovered the bug.

Those lists of revisions were returned from the database, and they were ordered according to their timestamp, which is a string like 20070311025946 (year month day hour minute seconds).
WikiTrust internally uses instead seconds since 1970, as most Unix systems, so that it can easily compute the amount of time elapsed between revisions. It turns out that the revisions returned from the database were not sorted according to their Unix time (at least, as I computed it)!
I had:

Revision 1: timestamp: 20070311025946 seconds: 1173610786
Revision 2: timestamp: 20070311035738 seconds: 1173610658

What was going on?

So, I dove into how times are converted, and I figured it out.
The problem is that Revision 1 was done between 2am and 3am of the morning when daylight savings time changed, and Revision 2 was done between 3am and 4am of that same morning.
The wrong conversion from timestamp to Unix time was due to a shortcoming of the Ocaml time conversion functions. It turns out that the Ocaml conversion function is based on local time, depend on daylight savings time, and does not provide for a way to remove the daylight savings time dependence.

As a solution, I implemented my own time-conversion functions (at least I know what is in them), and I added also other safeguards to ensure that the order used in the database matches the order used in the computation.

Still, this was one of the hardest-to-find bugs I have seen lately. It occurred only when there were two edits spaced by less than one hour across the boundary when daylight savings time changed in Spring. Moreover, it would occur only in the rare case when the merge of the list produced non-consecutive duplicates: consecutive duplicates were eliminated by the merge, and the algorithm worked just fine on simply out-of-order revisions (this is why it did not break once a year every Spring for many revisions).

The fact that developing the code requires detective skills makes it both intriguing, and frustrating.

And I suspect like most scientists, I hate, and I say hate, daylight savings time. It is the cause of no end of trouble in anything that needs to examine time series over an extended period.

Saturday, February 27, 2010

The font of whitespace

Or: on the chasm between software developers and software users.

Once, I was editing a document in one of MS Word or OpenOffice, I don't recall.
I had to add a few bullet points to a list. Try as I may, the bullets would not line up.
My bullets were invariably a little bit too much to the left, or to the right, and the whole vertical line of bullets was wavy. I was going crazy. Finally, I asked a friend for help. He immediately diagnosed: see, your whitespace is in the wrong font!.

That sentence has remained for me an epitome of the chasm between computer programmers and users. To a user, the notion of the font of whitespace is utterly meaningless; it is more or less like discussing whether the wine in an empty bottle is red or white -- there is no wine! But to a computer scientist focused on writing the details of the code, it makes perfect sense: you type a space, I show a space, and the space has a font-dependent length, to make words come out nicely when there is a single space between them.

This particular instance of the chasm is not so easily solved. Most of the low-end word-processing systems suffer from it; I am thankful to the genius of Don Knuth for making this a non-issue in TeX.

But if you look around you, you will see countless instances of the problem of the font of whitespace: countless instances in which the design makes perfect sense from the point of view of the software developers, but is utterly mistifying from the point of view of a user who has a simple task that needs getting done.