Friday, June 27, 2008

molecular fingerprints

I wrote a set of essays on molecular fingerprints:


  1. Molecular fingerprints, background

  2. Generating molecular fingerprints with OpenBabel

  3. Computing Tanimoto scores, quickly

  4. Block Tanimoto calculations

  5. Faster block Tanimoto calculations

  6. HAKMEM 169 and other popcount implementations

  7. Testing the bitslice algorithm for popcount




Feel free to leave comments about them here.

18 comments:

Toby Sargeant said...

It would be interesting to compare against the following canonical implementation of bit counting (MIT HACKMEM). Your 64k lookup table doesn't fit within the L1 data cache of a core 2 duo, and, in any case, will probably be partially evicted by surrounding code. Depending on the miss rate, and how well you can delay needing the result of the modulus (~17 cycles on core 2 duo, apparently), that this performs better. Because the division is by a constant, it may also be possible to transform it into a multiplication.

Have you run this code using vtune or oprofile?

There's also an SSE4 bit count instruction available for recent processor revisions (possibly currently only the AMD implementation of SSE4a).

uint32_t bitCount(uint32_t n) {
uint32_t tmp;

tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);

return ((tmp + (tmp >> 3)) & 030707070707) % 63;
}

Andrew Dalke said...

Hi Toby, and thanks for your comments.

On my machine (MacBook Pro, gcc 4.0.1) using a test harness where I compute the Tanimoto of hard-coded value (no file I/O, nor page swapping), I get 11 second using __builtin_popcount, 7 seconds with the bitCount you used, and 4 seconds using an 8 bit lookup table.

Early on when investigating the approaches to take, I used as a benchmark http://www.a1k0n.net/temp/popcnt.c.txt which reported

FreeBSD version 1 : 3789941 us; cnt=32511665
FreeBSD version 2 : 3790827 us; cnt=32511665
16-bit LUT : 2299615 us; cnt=32511665
8-bit LUT : 5266739 us; cnt=32511665
8x shift and add : 21415756 us; cnt=32511665
32x shift and add : 19378120 us; cnt=32511665

You can see that the 16-bit lookup table was the fastest.

Taking the FreeBSD version 2 algorithm in my test harness takes 5 seconds, which is faster than your bitCount algorithm.

The 8-bit LUT implementation is slow because it uses

cnt += lut[i&255];
cnt += lut[i>>8&255];
cnt += lut[i>>16&255];
cnt += lut[i>>24&255];

replacing it with

cnt += (lut[i&255] +
lut[i>>8&255] +
lut[i>>16&255] +
lut[i>>24]);

reduced the runtime to 3626584 microseconds.

Hmm, I thought the cache was larger. In any case, the fingerprint bits are sparse. Only about 10% of them are on, on average, which means there's large parts of the table which are rarely accessed.

I have a Mac, which I think means vtune nor oprofile work on it. I did last night learn how to use gcc with -fprofile-arcs and -fbranch-probabilities. Here's the effect on the popcnt.c benchmark (with my improved 8-bit implementation)

FreeBSD version 1 : 3758647 us; cnt=32511665
FreeBSD version 2 : 3422273 us; cnt=32511665
16-bit LUT : 2226455 us; cnt=32511665
8-bit LUT : 3865320 us; cnt=32511665
8x shift and add : 21359667 us; cnt=32511665
32x shift and add : 19419697 us; cnt=32511665

Not much of an effect.

Better might be to do the bit twiddling tricks on 64 bit long longs.

Yes, I would like to use the SSE4a (and for Intel SSE4.2) popcount instruction. I don't have a machine with that sort of chip on it.

Toby Sargeant said...

If you're using OSX, you can get Shark (part of the CHUD tools that you can get from developer.apple.com).

Andrew Dalke said...

I just added another essay in the series wherein I time a bunch of different algorithms using the same test harness so I (and you) could compare them directly.

Ernst-Georg Schmid said...

May I use the C code in pgchem::tigress?

http://pgfoundry.org/projects/pgchem/

With proper comment of course...

Andrew Dalke said...

Certainly. Though there's no need to ask as nothing I've done is new.

Anonymous said...

Very interesting, thanks a lot! Who knew a bit-twiddler could be faster than a look-up table...

PS: The last link at the top has wrong linking (there is localhost, instead of dalkescientific.com)...

Andrew Dalke said...

Fixed the localhost - thanks!

Anonymous said...

Maybe you know this, maybe you don't but the book "Beautiful code" has a chapter 10, entitled "The Quest for an Accelerated Population Count".
Hope it's useful..

Andrew Dalke said...

I didn't know about that chapter from "Beautiful Code." Thanks for the pointer! I'll see if I can borrow a copy from a friend.

I was able to preview pp147-150 online via Google Book Search. I also see someone published a few of the algorithms at http://www.crsr.net/Notes/PopulationCount.html which leads me to suspect that I've uncovered and tested more implementations than are described in the book.

This is a well-studied problem! :)

John said...

If you get an AMD Phenom PC or Barcelona quad core server it has a single cycle popcnt instruction accessible from the gnu compiler __builtin_popcount (32bits) and __builtin_popcountl
(64bits)

4.2 compilers that support this are are here -

http://developer.amd.com/CPU/GNU/Pages/default.aspx

Use -march=amdfam10

The amdfam10 support might also be in regular 4.3 release by now.

I think this is the fastest popcnt currently available on any x86 machine.

Andrew Dalke said...

I did mention that in passing but I see now I didn't say anything concrete about it.

I don't have or have access to a processor that supports that instruction so I couldn't test it out. I expect that in a couple of years all of this work to find fast popcount algorithms will no longer be needed, because it will be in hardware and, I guess, 10x faster.

Esc said...
This comment has been removed by the author.
Esc said...

By the way, i tested popcnt instruction on Phenom 8450, and it shows really nice results:
SIMD : 1406015 us;
FreeBSD version 1 : 3885512 us;
FreeBSD version 2 : 2819444 us;
16-bit LUT : 1807962 us;
8-bit LUT : 2861644 us;
8-bit LUT v2 : 2834989 us;
Wikipedia #3 : 1481475 us;
gcc popcount : 1882315 us;
gcc popcountll : 797657 us;
HAKMEM 169/X11 : 4175283 us;
Bitslice(7) : 1504259 us;
Bitslice(24) : 1409280 us;
naive : 29348288 us;
Wegner/Kernigan : 17737151 us;
Anderson : 8812272 us;
8x shift and add : 12098142 us;
32x shift and add : 10302451 us;

Esc said...

Previously posted code contains a bug so that's fixed version. I believe Bitslice SIMD realization could be even more fast:)
static int popcount_fbsdsimd(uint64 *buf, int n) {
int cnt=0;
int xx = 0;

__m128i* x = (__m128i*)(void*)buf;
x+=(n/=2);
do {

__m128i xmm = _mm_load_si128(&x[-n]);

const __m128i msk55 = _mm_set1_epi32(0x55555555);
const __m128i msk33 = _mm_set1_epi32(0x33333333);
const __m128i msk0F = _mm_set1_epi32(0x0F0F0F0F);
const __m128i mul01 = _mm_set1_epi32(0x01010101);

//xmm -= ((xmm >> 1) & 0x55555555);
__m128i tmp = _mm_and_si128(_mm_srli_epi32(xmm,1), msk55);
xmm = _mm_sub_epi32(xmm, tmp);
//xmm = (xmm & 0x33333333) + ((xmm >> 2) & 0x33333333);
tmp = _mm_and_si128(_mm_srli_epi32(xmm, 2), msk33);
xmm = _mm_add_epi32(_mm_and_si128(xmm, msk33),tmp);
//xmm = (xmm + (xmm >> 4)) & 0x0F0F0F0F;
tmp = _mm_srli_epi32(xmm,4);
xmm = _mm_and_si128(_mm_add_epi32(xmm,tmp),msk0F);
// .. mix up
tmp = _mm_shuffle_epi32(xmm, _MM_SHUFFLE(3,3,1,1));
xmm = _mm_add_epi32(tmp,xmm);
xmm = _mm_srli_epi64(_mm_mul_epu32(xmm,mul01), 24);
cnt+=((unsigned char*)&xmm)[0]+((unsigned char*)&xmm)[8];

} while (--n);
return cnt;
}

xfr said...

Is there a gcc popcount for 128-bit int on 64-bit Linux?
Since I couldn't find one, I can only use 2 64-bit popcnt to calculate it.
----------
typedef unsigned int uint128 __attribute__((__mode__(TI)));

static inline int popcount_gcc_TI(uint128 *buf, int n) {
int cnt=0;
while (n--) {
cnt += __builtin_popcountll(*(unsigned long long *)buf) + __builtin_popcountll(*(((unsigned long long *)buf)+1));
++buf;
}
return cnt;
}
----------

On Phenom 8450+x86_64 GNU/Linux...

FreeBSD version 1 : 5011208 us; cnt=32500610
FreeBSD version 2 : 2528886 us; cnt=32500610
16-bit LUT : 1642199 us; cnt=32500610
8-bit LUT : 2562202 us; cnt=32500610
8-bit LUT v2 : 2537370 us; cnt=32500610
Wikipedia #2 : 2125870 us; cnt=32500610
Wikipedia #3 : 1440718 us; cnt=32500610
gcc popcount : 861347 us; cnt=32500610
gcc popcountll : 799531 us; cnt=32500610
gcc popcountTI : 659075 us; cnt=32500610
HAKMEM 169/X11 : 4127204 us; cnt=32500610
Bitslice(7) : 1538298 us; cnt=32500610
Bitslice(24) : 1373552 us; cnt=32500610
naive : 22145508 us; cnt=32500610
Wegner/Kernigan : 15784335 us; cnt=32500610
Anderson : 8740204 us; cnt=32500610
8x shift and add : 10528122 us; cnt=32500610
32x shift and add : 9978773 us; cnt=32500610

Anonymous said...

Rather than optimizing your O(n) lookup, use a bk-tree with hamming distance to get O(log n) lookups.

Andrew Dalke said...

Hi Esc, xfr, and Anonymous, and thanks for your comments - including the ones I missed from earlier this year. It looks like I should update the popcount page now that it's been about 1.5 years.

Looking at the reported numbers, it seems that gcc's popcount should now be the default solution.

I hope to buy a new MacBook Pro soon, but I see now that even the latest machines don't support popcount on the chip.

Anonymous: I looked just now at the bk-tree solution. It might work for some of the use cases I deal with, so I'll have to try it out. I've implemented similar tree structures but found that the tree manipulation overhead is usually a lot worse than doing extremely fast brute search.

There will be problems if I want to support deletions and if I want to support non-metric comparisons, although I think that can be worked around for the case I have.

Thanks again everyone for the feedback!