Monday, March 15, 2010

KNIME, Pipeline Pilot, and visual programming languages

I presented a talk at CUP 2010 in Santa Fe on KNIME, Pipeline Pilot and visual programming languages in general, with comparison to other visual programming languages and text-based toolkit approaches. I've broken the talk up into parts, which are published on my technical writings blog.

This is the place to leave comments on those writings. The relevant postings are:

  • KNIME and beginners, where I walk through my problems trying to get heavy atom counts working in KNIME

9 comments:

Egon Willighagen said...

Wrapping CDK functionality in workflow APIs is a boring and time consuming business. It is very much demand triggered... who would have thought that someone would indeed use a heavy workflow system for just counting the number of heavy atoms.

That said, actually writing such a node is trivial and merely limited by someone sitting down to do it. Can you please point me to the feature request on the KNIME website, so that I can follow up on it?

Andrew Dalke said...

I thought the reason for a "heavy" system was to simplify common tasks. Let the computer do more so people can do less.

Who is the target audience for KNIME? Are they people who can write nodes on their own? What sort of cheminformatics are they interested in? Should they get the same sort of error messages and unexpected behaviors I got? Or am I being too picky?

I chose an example taken directly from the CTR, which in turn was based on things my clients have needed from me. It seems that KNIME (except perhaps with some of the commercial tools) can't handle that task.

I'll gladly be shown otherwise. Perhaps it's as easy as writing a new node using a bit of Jython code?

I've been looking for tutorials on how to do chemistry in KNIME but failed to find them. Perhaps you have pointers? I'll also take a pointer on how to write CDK-specific nodes, as the KNIME page only shows how to do generic nodes, and I have very little Java experience. For example, I would like to add a set of SMARTS matches, where the match count gets turned into a new field for the given molecule.

As for requests on the KNIME website, if it's important enough that you want to follow up on it, wouldn't it also be important enough that you can add a tracker item for it?

Perhaps I'm being obstinate, but there's several dozen properties my clients use during model building which aren't available in the CDK nodes. Simply going to a KNIME forum and asking for support for the number of heavy atoms is pointless without also having the Kier descriptors, vdW surface area terms, some Gaussian descriptors, the graph radius, and a lot more.

Some of these in turn require local resources, like docking targets which aren't public, so it's not like this can simply be pushed onto the CDK/KNIME developers. Instead, I would ask if there was a way to insert project-specific molecular property rules into the system, including dependency management so I don't have to worry about node ordering.

(More details in a future posting of mine.)

Egon Willighagen said...

"I thought the reason for a 'heavy' system was to simplify common tasks. Let the computer do more so people can do less."

That's indeed the point: there is no point in simplifying tasks that are already simple!

Have a look at the Taverna workflows at MyExperiment.org; these are no longer simple, and have not a single transformation (as in your test case), but do multiple transformations, mix data, split it again, do more transformations.

I am currently not involved in KNIME development, and can very much indeed to file a feature request in a suitable tracker, and enumerate the exact properties you like to see nodes for.

I have source code for ChemSpider nodes for KNIME available at:

http://github.com/egonw/knime-chemspider

I also noted:

http://nodes4knime.svn.sourceforge.net/viewvc/nodes4knime/

But the KNIME source code I had trouble finding it just now...

Andrew Dalke said...

"there is no point in simplifying tasks that are already simple!"

Then we disagree on what simplicity means. I think it's too complicated in CDK, OpenBabel, RDKit and OpenEye to open an SD file and get basic properties.

The closest solution is Cinfony, on top of those toolkits, but even then it doesn't have a decent set of molecular properties, much less an easy way to add new properties calculations.

Nor do I see an easy way for inexperienced programmers to add new property calculations to KNIME without a lot more work than that should entail.

Unknown said...

Andrew, are you going to follow up
with a post on PipelinePilot?
I'd be very interested to hear your impressions.
Re: KNIME - while I disagree that "visual languages are evil" I too
was not able to get KNIME to do anything useful in a short period of time that I tried.
I find PipelinePilot immensely useful though.

Andrew Dalke said...

Hi Igor! Yes, I'm going to post something about Pipeline Pilot, although it's going to take a while as both my clients want work from me now, and it's their money which pays for my rock&roll cheminformatics lifestyle.

The entire set of essays will be done by the end of next month, since I'll be presenting the updated version of the talk from CUP at EuroCUP.

Unknown said...

Hi all,

I think there is a bit of a misconception about KNIME here... KNIME does not claim to be a Pipeline Pilot replacement - the two tools are way too different in both their usage paradigm and their functionality. KNIME is an integration platform (for data, open source and legacy tools) and the fact that the (partial) CDK integration makes this feel like it should do all sorts of chemistry is probably misleading. Without nodes from some of the cheminf-vendors I would not see KNIME to be a serious player in this area.
Of course, one could and should expose more of the CDK functionality within KNIME - but that's the second difference to PP, KNIME is open source and mainly a result of volunteer work. We just don't have the resources to work on KNIME, the R, BIRT and other integrations and at the same time make sure all of the CDK functionality is exposed. But we always appreciate contributions from the outside. Some for-profit users do sponsor additional KNIME development, btw...
Lastly, KNIME, like many other graphical and presumably simple tools requires a bit of a learning curve as well - expecting it to solve your problems without looking at any of the manuls or the online documentation is expecting a bit too much. However, the beauty of these systems is that they will allow you to transfer this knowledge to other modules (encapsulating other tools, maybe?) and also communicate to others what you have done in an easy, intuitive way.

Anonymous said...
This comment has been removed by a blog administrator.
Evert said...

Andrew,
You can count heavy atoms using the Indigo nodes for Knime, which you can install from the community contributions. I'm sure you can figure it out.

Cheers,
Evert