Saturday, October 03, 2009
DNA searching is basically a cross-correlation process, implemented with what amounts to a convolution algorithm if you use numbers to symbolize DNA nucleotides. The same is true for searching amino acid sequences. In fact, amino acids can be coded by chemical similarity, similar chemical properties being assigned nearby numbers.
The thing is, straight-up convolution is sloooow. If the sequence to be searched is N units long, and the sequence searched for is M units long, then it takes O(N*M) work to do a full convolution. If N=10e+9 and M=1e+3, typical values for genome searching, that's 1e+12 computations per query. Ouch.
It occurred to me that there is a much better way of doing this. The Fourier transform of a convolution turns out to be the point-wise product of the Fourier transform of the things being convolved:
F[ convolution(f, g) ] = F[f] * F[g]
For our digital data, the Fourier transform can be done with the Fast Fourier Transform, a clever algorithm with a computing workload of O(N * log N), loads better than O(N2) for naive convolution. The inverse FFT, F-1, is equally fast.
With a little rearrangement, our convolution becomes simply a pair of FFTs, a pointwise product, and an inverse FFT:
convolution(f, g) = F-1[ F[f] * F[g] ]
The Fourier transform of the sequence library you are searching can be precomputed and stored, saving computation. The Fourier transform of the sequence you are searching for can probably be computed with a simplified algorithm, since it contains so little data compared to the library sequence.
So you can do full point-by-point DNA alignment with reasonable computer workload. You just have to be willing to take the Fourier transform of a chemical structure, which is a bit unconventional. Although not really unconventional compared to the things quantum physicists take the Fourier transform of.
But it turns out it was already invented [ref 2] when I was 8 years old.
Dang. Why are all the good ideas already taken?