- Molecular fingerprints, background
- Generating molecular fingerprints with OpenBabel
- Computing Tanimoto scores, quickly
- Block Tanimoto calculations
- Faster block Tanimoto calculations
- HAKMEM 169 and other popcount implementations
- Testing the bitslice algorithm for popcount
Feel free to leave comments about them here.
18 comments:
It would be interesting to compare against the following canonical implementation of bit counting (MIT HACKMEM). Your 64k lookup table doesn't fit within the L1 data cache of a core 2 duo, and, in any case, will probably be partially evicted by surrounding code. Depending on the miss rate, and how well you can delay needing the result of the modulus (~17 cycles on core 2 duo, apparently), that this performs better. Because the division is by a constant, it may also be possible to transform it into a multiplication.
Have you run this code using vtune or oprofile?
There's also an SSE4 bit count instruction available for recent processor revisions (possibly currently only the AMD implementation of SSE4a).
uint32_t bitCount(uint32_t n) {
uint32_t tmp;
tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
return ((tmp + (tmp >> 3)) & 030707070707) % 63;
}
Hi Toby, and thanks for your comments.
On my machine (MacBook Pro, gcc 4.0.1) using a test harness where I compute the Tanimoto of hard-coded value (no file I/O, nor page swapping), I get 11 second using __builtin_popcount, 7 seconds with the bitCount you used, and 4 seconds using an 8 bit lookup table.
Early on when investigating the approaches to take, I used as a benchmark http://www.a1k0n.net/temp/popcnt.c.txt which reported
FreeBSD version 1 : 3789941 us; cnt=32511665
FreeBSD version 2 : 3790827 us; cnt=32511665
16-bit LUT : 2299615 us; cnt=32511665
8-bit LUT : 5266739 us; cnt=32511665
8x shift and add : 21415756 us; cnt=32511665
32x shift and add : 19378120 us; cnt=32511665
You can see that the 16-bit lookup table was the fastest.
Taking the FreeBSD version 2 algorithm in my test harness takes 5 seconds, which is faster than your bitCount algorithm.
The 8-bit LUT implementation is slow because it uses
cnt += lut[i&255];
cnt += lut[i>>8&255];
cnt += lut[i>>16&255];
cnt += lut[i>>24&255];
replacing it with
cnt += (lut[i&255] +
lut[i>>8&255] +
lut[i>>16&255] +
lut[i>>24]);
reduced the runtime to 3626584 microseconds.
Hmm, I thought the cache was larger. In any case, the fingerprint bits are sparse. Only about 10% of them are on, on average, which means there's large parts of the table which are rarely accessed.
I have a Mac, which I think means vtune nor oprofile work on it. I did last night learn how to use gcc with -fprofile-arcs and -fbranch-probabilities. Here's the effect on the popcnt.c benchmark (with my improved 8-bit implementation)
FreeBSD version 1 : 3758647 us; cnt=32511665
FreeBSD version 2 : 3422273 us; cnt=32511665
16-bit LUT : 2226455 us; cnt=32511665
8-bit LUT : 3865320 us; cnt=32511665
8x shift and add : 21359667 us; cnt=32511665
32x shift and add : 19419697 us; cnt=32511665
Not much of an effect.
Better might be to do the bit twiddling tricks on 64 bit long longs.
Yes, I would like to use the SSE4a (and for Intel SSE4.2) popcount instruction. I don't have a machine with that sort of chip on it.
If you're using OSX, you can get Shark (part of the CHUD tools that you can get from developer.apple.com).
I just added another essay in the series wherein I time a bunch of different algorithms using the same test harness so I (and you) could compare them directly.
May I use the C code in pgchem::tigress?
http://pgfoundry.org/projects/pgchem/
With proper comment of course...
Certainly. Though there's no need to ask as nothing I've done is new.
Very interesting, thanks a lot! Who knew a bit-twiddler could be faster than a look-up table...
PS: The last link at the top has wrong linking (there is localhost, instead of dalkescientific.com)...
Fixed the localhost - thanks!
Maybe you know this, maybe you don't but the book "Beautiful code" has a chapter 10, entitled "The Quest for an Accelerated Population Count".
Hope it's useful..
I didn't know about that chapter from "Beautiful Code." Thanks for the pointer! I'll see if I can borrow a copy from a friend.
I was able to preview pp147-150 online via Google Book Search. I also see someone published a few of the algorithms at http://www.crsr.net/Notes/PopulationCount.html which leads me to suspect that I've uncovered and tested more implementations than are described in the book.
This is a well-studied problem! :)
If you get an AMD Phenom PC or Barcelona quad core server it has a single cycle popcnt instruction accessible from the gnu compiler __builtin_popcount (32bits) and __builtin_popcountl
(64bits)
4.2 compilers that support this are are here -
http://developer.amd.com/CPU/GNU/Pages/default.aspx
Use -march=amdfam10
The amdfam10 support might also be in regular 4.3 release by now.
I think this is the fastest popcnt currently available on any x86 machine.
I did mention that in passing but I see now I didn't say anything concrete about it.
I don't have or have access to a processor that supports that instruction so I couldn't test it out. I expect that in a couple of years all of this work to find fast popcount algorithms will no longer be needed, because it will be in hardware and, I guess, 10x faster.
By the way, i tested popcnt instruction on Phenom 8450, and it shows really nice results:
SIMD : 1406015 us;
FreeBSD version 1 : 3885512 us;
FreeBSD version 2 : 2819444 us;
16-bit LUT : 1807962 us;
8-bit LUT : 2861644 us;
8-bit LUT v2 : 2834989 us;
Wikipedia #3 : 1481475 us;
gcc popcount : 1882315 us;
gcc popcountll : 797657 us;
HAKMEM 169/X11 : 4175283 us;
Bitslice(7) : 1504259 us;
Bitslice(24) : 1409280 us;
naive : 29348288 us;
Wegner/Kernigan : 17737151 us;
Anderson : 8812272 us;
8x shift and add : 12098142 us;
32x shift and add : 10302451 us;
Previously posted code contains a bug so that's fixed version. I believe Bitslice SIMD realization could be even more fast:)
static int popcount_fbsdsimd(uint64 *buf, int n) {
int cnt=0;
int xx = 0;
__m128i* x = (__m128i*)(void*)buf;
x+=(n/=2);
do {
__m128i xmm = _mm_load_si128(&x[-n]);
const __m128i msk55 = _mm_set1_epi32(0x55555555);
const __m128i msk33 = _mm_set1_epi32(0x33333333);
const __m128i msk0F = _mm_set1_epi32(0x0F0F0F0F);
const __m128i mul01 = _mm_set1_epi32(0x01010101);
//xmm -= ((xmm >> 1) & 0x55555555);
__m128i tmp = _mm_and_si128(_mm_srli_epi32(xmm,1), msk55);
xmm = _mm_sub_epi32(xmm, tmp);
//xmm = (xmm & 0x33333333) + ((xmm >> 2) & 0x33333333);
tmp = _mm_and_si128(_mm_srli_epi32(xmm, 2), msk33);
xmm = _mm_add_epi32(_mm_and_si128(xmm, msk33),tmp);
//xmm = (xmm + (xmm >> 4)) & 0x0F0F0F0F;
tmp = _mm_srli_epi32(xmm,4);
xmm = _mm_and_si128(_mm_add_epi32(xmm,tmp),msk0F);
// .. mix up
tmp = _mm_shuffle_epi32(xmm, _MM_SHUFFLE(3,3,1,1));
xmm = _mm_add_epi32(tmp,xmm);
xmm = _mm_srli_epi64(_mm_mul_epu32(xmm,mul01), 24);
cnt+=((unsigned char*)&xmm)[0]+((unsigned char*)&xmm)[8];
} while (--n);
return cnt;
}
Is there a gcc popcount for 128-bit int on 64-bit Linux?
Since I couldn't find one, I can only use 2 64-bit popcnt to calculate it.
----------
typedef unsigned int uint128 __attribute__((__mode__(TI)));
static inline int popcount_gcc_TI(uint128 *buf, int n) {
int cnt=0;
while (n--) {
cnt += __builtin_popcountll(*(unsigned long long *)buf) + __builtin_popcountll(*(((unsigned long long *)buf)+1));
++buf;
}
return cnt;
}
----------
On Phenom 8450+x86_64 GNU/Linux...
FreeBSD version 1 : 5011208 us; cnt=32500610
FreeBSD version 2 : 2528886 us; cnt=32500610
16-bit LUT : 1642199 us; cnt=32500610
8-bit LUT : 2562202 us; cnt=32500610
8-bit LUT v2 : 2537370 us; cnt=32500610
Wikipedia #2 : 2125870 us; cnt=32500610
Wikipedia #3 : 1440718 us; cnt=32500610
gcc popcount : 861347 us; cnt=32500610
gcc popcountll : 799531 us; cnt=32500610
gcc popcountTI : 659075 us; cnt=32500610
HAKMEM 169/X11 : 4127204 us; cnt=32500610
Bitslice(7) : 1538298 us; cnt=32500610
Bitslice(24) : 1373552 us; cnt=32500610
naive : 22145508 us; cnt=32500610
Wegner/Kernigan : 15784335 us; cnt=32500610
Anderson : 8740204 us; cnt=32500610
8x shift and add : 10528122 us; cnt=32500610
32x shift and add : 9978773 us; cnt=32500610
Rather than optimizing your O(n) lookup, use a bk-tree with hamming distance to get O(log n) lookups.
Hi Esc, xfr, and Anonymous, and thanks for your comments - including the ones I missed from earlier this year. It looks like I should update the popcount page now that it's been about 1.5 years.
Looking at the reported numbers, it seems that gcc's popcount should now be the default solution.
I hope to buy a new MacBook Pro soon, but I see now that even the latest machines don't support popcount on the chip.
Anonymous: I looked just now at the bk-tree solution. It might work for some of the use cases I deal with, so I'll have to try it out. I've implemented similar tree structures but found that the tree manipulation overhead is usually a lot worse than doing extremely fast brute search.
There will be problems if I want to support deletions and if I want to support non-metric comparisons, although I think that can be worked around for the case I have.
Thanks again everyone for the feedback!
Post a Comment