Sunday, January 10, 2010

Fingerprint File Format

I proposed a fingerprint file format for small molecule chemistry. Here's the place to leave comments or suggestions.

5 comments:

Stewart Adcock said...

Hi Andrew,

Interesting topic, and you propose an interesting solution.

Last year we ( MEDIT ) approached a similar set of use cases. To summarize our solution, we encoded our fingerprints in ASCII text and stored them in the data block of SDFile files. We also encoded any essential fingerprint parameters (e.g., bit count, length of topological paths for "Daylight-style" fps, etc.) in the data key. It's not pretty, but it works.

I think from an efficiency perspective, your solution will be the clear winner.

Irrespective of any technical merits, however, I think the real killer benefit here would be a format that is supported across tools. I doubt we are alone in having internal tools that would work with arbitrary fingerprints, but currently we can only use our own internal fp generation software because that is what creates them in the required format.

I wholeheartedly encourage you to set up a public repository...

Stewart

Andrew Dalke said...

Hi Stewart and thanks for your feedback and support.

I'm curious to know a bit more about how you encode your information in SD files. Are the fingerprints encoded in hex or something else, and what is the bit order? If you haven't seen it, the CACTVS substructure key encoding uses base64 encoding to try to make that information more compact at 6 bits per byte instead of 4.

Could you send me more information about the generation parameters you encode? I'm trying to get a feel for how to handle that and would like to see what others did.

My current thought is that some parameters are very specific to an implementation and should be encoded in some normalized way, so that other code can give a warning when two fingerprint data sets are not compatible.

But fingerprint folding, and perhaps other parameters, are more universal and might be stored differently.

I've started talking with some of the vendors - mostly the free toolkits but one commercial vendor as well - to see if there's interest. Once I hear back from them I'll likely set up a public repository.

Cheers!

Stewart Adcock said...

The encoding we use depends on the type of fingerprint. For dense bitstrings we also use Base64 encoding. These use little-endian ordering - but for no better reason than that is how we store the fp in memory.

For other fingerprint types, we simply use a comma-separated list of integers or floating-point values. This would be dreadful for storing long fingerprints, but ours are very short so there is no problem. To fit with your PNG-based idea, I can imagine that these could readily be represented as a sparse bitstring and stored using something like run-length encoding.

I guess common generation parameters would include degree of folding or folded fingerprint length. As you say, these might be universal properties. Our internal fingerprints additionally have some parameters that we like to tune. The specifics aren't interesting, except to say that if the parameters aren't identical, the fingerprints aren't comparable. Recording these as optional and arbitrary key-value pairs would be sufficient for our purposes.

Even if it was only supported by one major vendor, then there would probably be sufficient value for us to also include it in one of our commercial tools. But, this is the important point for me: Specification of the format doesn't matter as much as the prospect of interoperability.

Stewart

Andrew Dalke said...

Thanks for the additional info Stewart. I don't have enough experience with spare fingerprints over large values to be able to support that well so for now I'm just working on dense fingerprints, like the ones you base64 encode.

I'll see what sort of response I get from others. Perhaps I'll do a poster on this for OpenEye's CUP, and for the Sheffield conference.

Andrew Dalke said...

Given the feedback I got from various people, I've set up a new Google Code project for this, at http://code.google.com/p/chem-fingerprints/ . I'll announce it more on my feed in the next couple of days.