Wednesday, October 31, 2007

OpenSMILES, molecular formulas and ANTLR

This is the place to comment on the various articles I wrote on parsing molecular formulas and SMILES strings with ANTLR.


Anonymous said...

Maybe you've mentioned it already, but why not PyParsing?

Andrew Dalke said...

I haven't explained it. I should. I'm using ANTLR because of it's back-end support for multiple languages. I want to get a SMILES and SMARTS parser working which is relatively portable to other languages.

Staying inside of Python, I prefer ply over PyParsing. I tried to get PyParsing to handle an indentation-based language like Python and couldn't figure it out, while I could with ply. I've also had more experience with the style ply uses, because of my previous experience with SPARK.

Anonymous said...

I think those warnings are because "separator" is misspelled as "seperator". That's a pretty smart compiler!

Anonymous said...

This is a minor optimization but
the generator should consider the following (leading whitespace replaced with . for readability since this forum software seems not to support pre):

lafunc = self.input.LA
mrfunc = self.matchRange
cnt1 = 0
while True: #loop1
..alt1 = 2
..LA1_0 = lafunc(1)
..if ((u'0' <= LA1_0 <= u'9')):
....alt1 = 1
..if alt1 == 1:
....count = lafunc(1)
....mrfunc(u'0', u'9')
....if cnt1 >= 1:
......break #loop1
....eee = EarlyExitException(1, self.input)
....raise eee
..cnt1 += 1

Andrew Dalke said...

My experience has been that that sort of optimization for this case won't give much speedup. In this case there is usually only one or two digits, so it saves only a couple of attribute lookups. Given everything else going on, that's at best 1% speedup at the expense of slightly harder to read code.

Benjamin Niemann wrote that he hasn't yet focused on speedups. Perhaps there are bigger gains, but it will be very hard to get a factor of 4 out of it to be faster than PLY.

Anonymous said...

I pleasantly discovered the timboot's recursive descent program. It would be nice for those who want to study a handcrafted parser.

Andrew Dalke said...

Where is "timboot's"? A few basic Google searches didn't help.

Anonymous said...

I was referring to tim peters sorry for the misspelling. I appreciate the graphical parse tree in your blog article.

the url is (googling molecular weight Python).

Anonymous said...

a considerable speedup in pyparsing can be achieved, when using ParserElement.enablePackrat()

Andrew Dalke said...

Thanks for the pointer to the packrat parser support in pyparsing. It looks like that was in pyparsing when I first wrote the article, but I didn't notice. Reading the documentation now, I see it has some problems when there are callbacks which change global information, but that's not the case here.

Molecular formulas are so small that I wonder if the packrat parser overhead would be a problem. Well, that's for timing tests to find out. I doubt I'll get back to this any time soon though.

John Machin said...

Subject: regex for matching atom names

# (This is more complicated than needed; it's to show how
# this approach can scale to all 100+ known and named elements)
# Creates a pattern like: Cl|C|H|O|S

Doesn't the pattern [A-Z][a-z]? do the job?

Anonymous said...

I enjoyed the molecular weight example. It was a very nice bootstrap into ANTLR. Thanks!

ptmcg said...

To John Machin's comment re: "[A-Z][a-z]?" to parse chemical symbols:

This accepts all one or two character combinations of upper+optional-lowercase letters, many of which are not valid chemical symbols. Also, there are a handful of "Uux" elements, too.

Here is a regex for the chemical symbols from the pyparsing wiki (of all places to find a regular expression!):


(Although I don't think this will parse the latest UuX element just discovered last month.)

Andrew Dalke said...

Hi ptmcg!

The Uu* names are temporary IUPAC systematic element name - placeholders. For example, Uun is as of 2003 known as Ds. You mentioned ununbium (element 112) which was recently officially recognised by IUPAC as an element, after about 13 years of work. Within about a year it should have its own two-letter symbol.

Support for such exotic atoms will be domain specific. For example, in chemistry there are people who use "D" and "T" as atomic symbols for deuterium and tritium, rather than 2H and 3H. They will also be time-dependent, as new elements are named.

As you pointed out, the pattern will match many symbols which are not elements. I think this is an esthetic choice. The tokenizer/parser does not need to be the sole validation system. For example, additional code - a lookup table - could then verify that a given symbol is correct.

As to the presence of that regular expression in pyparsing - didn't you add it based on my PLY code? The logs for that code show that "ptmcg" added it, and my emails have a ptmcg who wrote me on 25 Nov 2007 saying:

"I found your SMILES parser that you had written in PLY, and wondered what a pyparsing version might look like."

I suspect that that regexp is based on a regexp I myself wrote. ;)

John Machin said...

@Andrew: "the pattern will match many symbols which are not elements. I think this is an esthetic choice. The tokenizer/parser does not need to be the sole validation system. For example, additional code - a lookup table - could then verify that a given symbol is correct."

Yeah -- a lookup table like the table of atomic weights that you already had, and derived the regex from! Listing out all the possible values is inefficient, because Python's re doesn't generate a DFA for such expressions, and a common element like S would be near the end of the list. IOW it DOESN'T scale well.

Anonymous said...

It's great to find a detailed post about Python and antlr. I've been trucking away on it, and even bought Parr's book, but ran into a few python-specific issues which you cover. Thanks for sharing your thoughts!

emir said...

Does Antlr creates multiple parse trees for ambigious sentences? Because I want to use it for a natural language.

Andrew Dalke said...

@John Machin: It's been a few months and I just now noticed your comment: "IOW it DOESN'T scale well."

It doesn't need to scale well. There's only 100 odd elements. On my machine it takes 0.624 usec to match S using my exact regexp match, while the pattern [A-Z][a-z]? takes 0.512 usec. (Matching "Ac" with the complex regexp takes 0.524 usec.)

That is 20% slower, but there are other considerations to make. For example, if someone puts in "Sd", which is an element which doesn't exist. Should it try to process an "S" and then give an error when processing the "d", or should it say that "Sd" just doesn't exist?

In any case, the lookup table takes another 0.25 us, so the difference will depend on what you want to do. If it's only validating that the SMILES string is correct then leaving out the lookup step needed for post-match validation is faster.

@Anonymous: Thank you for your kind words.

@emir: I don't know if ANTLR handles that case. I don't do natural language processing and haven't had to worry about that.

ahtokca said...

>warning(200): MWGrammar.g:10:9: Decision can match input such as "'C'"
using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

It's just an ambiguity thing.
Just use the following option
(options {greedy=true;} : ...)
to resolve parser nondeterminism.