comments on Andrew Dalke's writings: OpenSMILES, molecular formulas and ANTLR

Wednesday, October 31, 2007

OpenSMILES, molecular formulas and ANTLR

This is the place to comment on the various articles I wrote on parsing molecular formulas and SMILES strings with ANTLR.

OpenSMILES

Parser generators and SMILES

Calculating MW with ANTLR

Java, and comparisons to PLY and PyParsing

19 comments:

Anonymous said...: Maybe you've mentioned it already, but why not PyParsing?; November 1, 2007 at 10:28 AM
Andrew Dalke said...: I haven't explained it. I should. I'm using ANTLR because of it's back-end support for multiple languages. I want to get a SMILES and SMARTS parser working which is relatively portable to other languages.

Staying inside of Python, I prefer ply over PyParsing. I tried to get PyParsing to handle an indentation-based language like Python and couldn't figure it out, while I could with ply. I've also had more experience with the style ply uses, because of my previous experience with SPARK.; November 1, 2007 at 11:25 AM
Anonymous said...: I think those warnings are because "separator" is misspelled as "seperator". That's a pretty smart compiler!
--dang; November 2, 2007 at 5:51 AM
Anonymous said...: This is a minor optimization but
the generator should consider the following (leading whitespace replaced with . for readability since this forum software seems not to support pre):

lafunc = self.input.LA
mrfunc = self.matchRange
cnt1 = 0
while True: #loop1
..alt1 = 2
..LA1_0 = lafunc(1)
..if ((u'0' <= LA1_0 <= u'9')):
....alt1 = 1
..if alt1 == 1:
....count = lafunc(1)
....mrfunc(u'0', u'9')
..else:
....if cnt1 >= 1:
......break #loop1
....eee = EarlyExitException(1, self.input)
....raise eee
..cnt1 += 1; November 4, 2007 at 10:38 AM
Andrew Dalke said...: My experience has been that that sort of optimization for this case won't give much speedup. In this case there is usually only one or two digits, so it saves only a couple of attribute lookups. Given everything else going on, that's at best 1% speedup at the expense of slightly harder to read code.

Benjamin Niemann wrote that he hasn't yet focused on speedups. Perhaps there are bigger gains, but it will be very hard to get a factor of 4 out of it to be faster than PLY.; November 4, 2007 at 3:06 PM
Anonymous said...: I pleasantly discovered the timboot's mol.py recursive descent program. It would be nice for those who want to study a handcrafted parser.; December 23, 2007 at 6:01 AM
Andrew Dalke said...: Where is "timboot's mol.py"? A few basic Google searches didn't help.; December 23, 2007 at 7:33 AM
Anonymous said...: I was referring to tim peters molw.py sorry for the misspelling. I appreciate the graphical parse tree in your blog article.

the url is (googling molecular weight Python).
http://mail.python.org/pipermail/tutor/1999-March/000083.html; December 23, 2007 at 9:04 AM
Anonymous said...: a considerable speedup in pyparsing can be achieved, when using ParserElement.enablePackrat(); March 10, 2009 at 10:32 AM
Andrew Dalke said...: Thanks for the pointer to the packrat parser support in pyparsing. It looks like that was in pyparsing when I first wrote the article, but I didn't notice. Reading the documentation now, I see it has some problems when there are callbacks which change global information, but that's not the case here.

Molecular formulas are so small that I wonder if the packrat parser overhead would be a problem. Well, that's for timing tests to find out. I doubt I'll get back to this any time soon though.; March 11, 2009 at 8:46 AM
Unknown said...: Subject: regex for matching atom names

"""
# (This is more complicated than needed; it's to show how
# this approach can scale to all 100+ known and named elements)
[snip]
# Creates a pattern like: Cl|C|H|O|S
"""

Doesn't the pattern [A-Z][a-z]? do the job?; April 19, 2009 at 4:53 PM
Anonymous said...: I enjoyed the molecular weight example. It was a very nice bootstrap into ANTLR. Thanks!; May 30, 2009 at 4:31 AM
ptmcg said...: To John Machin's comment re: "[A-Z][a-z]?" to parse chemical symbols:

This accepts all one or two character combinations of upper+optional-lowercase letters, many of which are not valid chemical symbols. Also, there are a handful of "Uux" elements, too.

Here is a regex for the chemical symbols from the pyparsing wiki (of all places to find a regular expression!):

A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|
D[bsy]|E[rsu]|F[emr]?|G[ade]|H[efgos]?|
I[nr]?|Kr?|L[airu]|M[dgnot]|N[abdeiop]?|
Os?|P[abdmortu]?|R[abefghnu]|
S[bcegimnr]?|T[abcehilm]|Uu[bhopqst]|
U|V|W|Xe|Yb?|Z[nr]

(Although I don't think this will parse the latest UuX element just discovered last month.); July 9, 2009 at 2:25 PM
Andrew Dalke said...: Hi ptmcg!

The Uu* names are temporary IUPAC systematic element name - placeholders. For example, Uun is as of 2003 known as Ds. You mentioned ununbium (element 112) which was recently officially recognised by IUPAC as an element, after about 13 years of work. Within about a year it should have its own two-letter symbol.

Support for such exotic atoms will be domain specific. For example, in chemistry there are people who use "D" and "T" as atomic symbols for deuterium and tritium, rather than 2H and 3H. They will also be time-dependent, as new elements are named.

As you pointed out, the pattern will match many symbols which are not elements. I think this is an esthetic choice. The tokenizer/parser does not need to be the sole validation system. For example, additional code - a lookup table - could then verify that a given symbol is correct.

As to the presence of that regular expression in pyparsing - didn't you add it based on my PLY code? The logs for that code show that "ptmcg" added it, and my emails have a ptmcg who wrote me on 25 Nov 2007 saying:

"I found your SMILES parser that you had written in PLY, and wondered what a pyparsing version might look like."

I suspect that that regexp is based on a regexp I myself wrote. ;); July 9, 2009 at 3:11 PM
Unknown said...: @Andrew: "the pattern will match many symbols which are not elements. I think this is an esthetic choice. The tokenizer/parser does not need to be the sole validation system. For example, additional code - a lookup table - could then verify that a given symbol is correct."

Yeah -- a lookup table like the table of atomic weights that you already had, and derived the regex from! Listing out all the possible values is inefficient, because Python's re doesn't generate a DFA for such expressions, and a common element like S would be near the end of the list. IOW it DOESN'T scale well.; July 9, 2009 at 7:08 PM
Anonymous said...: It's great to find a detailed post about Python and antlr. I've been trucking away on it, and even bought Parr's book, but ran into a few python-specific issues which you cover. Thanks for sharing your thoughts!; September 5, 2009 at 9:06 AM
emir said...: Does Antlr creates multiple parse trees for ambigious sentences? Because I want to use it for a natural language.; September 23, 2010 at 6:32 AM
Andrew Dalke said...: @John Machin: It's been a few months and I just now noticed your comment: "IOW it DOESN'T scale well."

It doesn't need to scale well. There's only 100 odd elements. On my machine it takes 0.624 usec to match S using my exact regexp match, while the pattern [A-Z][a-z]? takes 0.512 usec. (Matching "Ac" with the complex regexp takes 0.524 usec.)

That is 20% slower, but there are other considerations to make. For example, if someone puts in "Sd", which is an element which doesn't exist. Should it try to process an "S" and then give an error when processing the "d", or should it say that "Sd" just doesn't exist?

In any case, the lookup table takes another 0.25 us, so the difference will depend on what you want to do. If it's only validating that the SMILES string is correct then leaving out the lookup step needed for post-match validation is faster.

@Anonymous: Thank you for your kind words.

@emir: I don't know if ANTLR handles that case. I don't do natural language processing and haven't had to worry about that.; September 23, 2010 at 10:49 AM
ahtokca said...: >warning(200): MWGrammar.g:10:9: Decision can match input such as "'C'"
using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

It's just an ambiguity thing.
Just use the following option
(options {greedy=true;} : ...)
to resolve parser nondeterminism.; June 17, 2011 at 7:27 AM

comments on Andrew Dalke's writings

Wednesday, October 31, 2007

OpenSMILES, molecular formulas and ANTLR

19 comments:

Blog Archive

About Me