Monday, January 10, 2011

subgraph enumeration

I've started a series on subgraph enumeration. In Part 1 I present a brute force enumeration algorithm, modify it to remove duplicates, and post-process the results to generate canonical SMARTS for each subgraph. In Part 2 I develop a new algorithm which is about 3.5 times faster.

6 comments:

Dmitry Pavlov said...

Hello Andrew,

Thanks for the article, it is very interesting and reads easily.

Regarding your faster enumeration technique: To me, it looks similar to a more general technique called "reverse search" (Avis and Fukuda, 1993). Here is the brief description of the technique applied to the described problem:

1. Put the empty subgraph to the queue.
2. If the queue is empty, quit. Otherwise, take the subgraph from the queue.
3. If the last added edge on the subgraph is the biggest/smallest by number among all the hanging and ring edges on the subgraph, then generate the next subgraphs with all possible additions of one vertex/edge in the current subgraph and put them into the queue. Go to step 2.

In Indigo, we have an algorithm of this kind implemented as a part of our fingerprint builder which enumerates all the graph's subtrees (available online at GitHub).

A while ago, our fingerprint builder was based on enumeration of arbitrary subgraphs (not just subtrees), and we had the implementation of the subgraphs enumerator. We decided against the arbitrary subgraph-based fingerprints due to performance considerations, and withdrew the source code of subgraph enumerator from our repository. But now I see that it may have been a good idea to keep it (and provide Java/C#/Python wrappers), so that nobody would need to reinvent and rewrite the whole thing again :)

Best regards,

Dmitry

Dmitry Pavlov said...

Hello Andrew,

We have restored the edge subgraph enumeration facility in our Indigo toolkit. It uses the reverse search technique. It is different from the algorithm that you have invented, and for me it is hard to say which algorithm is faster.

The API for the enumeration of the edge subgraphs is provided in the most recent (beta7) version of Indigo. In Python, the enumeration will be as easy as:

from indigo import *
indigo = Indigo()

mol = indigo.loadMolecule("CC(C)CC")
mol.foldHydrogens() # if there were hydrogens in the structure
# here you specify minimum (1) and maximum (6) number of edges in the subraphs
for item in mol.iterateEdgeSubmolecules(1, 6):   # SMILES
  print item.index(), item.clone().smiles()
  # atom and bond indices in the original molecule
  print ' ', [atom.index() for atom in item.iterateAtoms()]
  print ' ', [bond.index() for bond in item.iterateBonds()]

The source code is available on GitHub.

Dmitry Pavlov said...

Hello Andrew,

We have restored the edge subgraph enumeration facility in our Indigo toolkit. It uses the reverse search technique. It is different from the algorithm that you have invented, and for me it is hard to say which algorithm is faster.

The API for the enumeration of the edge subgraphs is provided in the most recent beta version of Indigo. In Python, the enumeration will be as easy as:

from indigo import *
indigo = Indigo()

mol = indigo.loadMolecule("CC(C)CC")
mol.foldHydrogens() # if there were hydrogens in the structure
# here you specify minimum (1) and maximum (6) number of edges in the subraphs
for item in mol.iterateEdgeSubmolecules(1, 6):   # SMILES
  print item.index(), item.clone().smiles()
  # atom and bond indices in the original molecule
  print ' ', [atom.index() for atom in item.iterateAtoms()]
  print ' ', [bond.index() for bond in item.iterateBonds()]

The source code is available on GitHub.

Lorenz Blum said...

In the post "Subgraph enumeration" you wrote "People conjecture about how enumeration might suggest new drug-like candidates, but I've not heard of any success along those lines."

Well I do since I am working on it :-). There is the GDB project[1,2] which exhaustively enumerates all possible compounds up to a certain size and then filters them to synthetically "feasible" and "stable" compounds (which leaves only 0.1% back). There are first papers of sucessful virtual screening, synthesis and finding of actives[3,4]. Its a really fascinating project.

[1] Fink, T.; Reymond, J.-L. J Chem Inf Model 2007
[2]Blum, Reymond J. Am. Chem. Soc. 2009
[3] Nguyen, K. T.; Syed, S.; Urwyler, S.; Bertrand, S.; Bertrand, D.; Reymond, J.-L. ChemMedChem 2008

[4] Luethi, E.; Nguyen, K. T.; Buerzle, M.; Blum, L. C.; Suzuki, Y.; Hediger, M.; Reymond, J.-
L. J Med Chem 2010

Anonymous said...

Thank you very much for this! I'm attempting to put together my own chemistry engine in C# and I've been struggling to find guidance of substructure enumeration. This has been very helpful.

Andrew Dalke said...

Thanks Anonymous!