Wednesday, August 3, 2016

Reading ASCII file in Python3.5 is 2-3x faster as bytes than string

I wrote an essay which compared the read performance in Python3.5 between bytes and the Unicode text options 'newline' and 'encoding'. I concluded that I couldn't get the Unicode string performance to within a factor of 2 of the binary byte performance, so chemfp will be working with bytes, not strings.

I also checked how the RDKit handles invalid Unicode, to see what another toolkit did for the same problem. I concluded that it uses bytes internally and exposes strings, which causes problems if those bytes cannot be converted to strings.

This is the place to leave comments about that post.

No comments: