Saturday, March 6, 2010

I hate daylight savings time

I just found one of the most mistifying bugs I had to deal with recently.

Once in a veeery long while, in very rare cases, the analysis of a Wikipedia page would break for no clear reason, after going through a few thousands revisions.
One of these was the page on Captain America.

I looked at the logs, and what did I see? I saw -apparently- an implementation of merge-in-sorted-oder not working. I had to merge the following lists of revisions, eliminating duplicates, in chronological order:

  • L1: 114223615 114213195 114177557 114172306 114154003 114153297 114149504 114146133
  • L2: 114213195 114177557 85922013 85906965
  • Result: 114213195 114223615 114213195 114177557 114172306 114154003 114153297 114149504 114146133 85922013 85906965

The revision 114213195 appeared twice in the results! This broke the analysis.

A mistake in my merge algorithm? No. It was just a few lines, tested it many times, stared at it even longer. No, the algorithm is fine.

A mistake in the data structure used to represent lists? I use my own data structure, called Vec, which is a functional implementation of extensible arrays that supports arbitrary element deletion and insertion. That data structure is used all over the place in my WikiTrust code, so it would be both serious, and strange, to find an error in it after such a long time it has been in use. Still, everything is possible, so I instrumented the library in detail. No, Vec worked fine; the problem was somewhere else.

A problem caused by memory corruption? I ran the code on both 32-but and 64-bit systems with different memory sizes, and it still broke in the same way, so this was unlikely.

Then I looked closely at what those revisions are, and why the merge behaved in a strange way, and I discovered the bug.

Those lists of revisions were returned from the database, and they were ordered according to their timestamp, which is a string like 20070311025946 (year month day hour minute seconds).
WikiTrust internally uses instead seconds since 1970, as most Unix systems, so that it can easily compute the amount of time elapsed between revisions. It turns out that the revisions returned from the database were not sorted according to their Unix time (at least, as I computed it)!
I had:

Revision 1: timestamp: 20070311025946 seconds: 1173610786
Revision 2: timestamp: 20070311035738 seconds: 1173610658

What was going on?

So, I dove into how times are converted, and I figured it out.
The problem is that Revision 1 was done between 2am and 3am of the morning when daylight savings time changed, and Revision 2 was done between 3am and 4am of that same morning.
The wrong conversion from timestamp to Unix time was due to a shortcoming of the Ocaml time conversion functions. It turns out that the Ocaml conversion function is based on local time, depend on daylight savings time, and does not provide for a way to remove the daylight savings time dependence.

As a solution, I implemented my own time-conversion functions (at least I know what is in them), and I added also other safeguards to ensure that the order used in the database matches the order used in the computation.

Still, this was one of the hardest-to-find bugs I have seen lately. It occurred only when there were two edits spaced by less than one hour across the boundary when daylight savings time changed in Spring. Moreover, it would occur only in the rare case when the merge of the list produced non-consecutive duplicates: consecutive duplicates were eliminated by the merge, and the algorithm worked just fine on simply out-of-order revisions (this is why it did not break once a year every Spring for many revisions).

The fact that developing the code requires detective skills makes it both intriguing, and frustrating.

And I suspect like most scientists, I hate, and I say hate, daylight savings time. It is the cause of no end of trouble in anything that needs to examine time series over an extended period.