I wrote an essay on why I don't agree with the idea that good peer review requires open source. Why is the right to redistribute, on top access to the source code, needed for peer review?
My followup question, once you've answered the first, is to explain how I can review source code and not be accused some time in the future of using that source code in my own code, in violation of copyright and license agreement. This can occur even with free or open source licenses, for various reasons.
Tuesday, November 11, 2008
Subscribe to:
Post Comments (Atom)
22 comments:
Hi,
there are several issues here.
One is the quality of the results of someones work. If you can't see the code AND the data, you can't say anything about the results of someone. At least that's my experience in astronomy. The solution is to have public data and the following rights with the code:
1.) the freedom to use the software for any purpose (most commercial licenses don't allow this, and $600-$xxxx is a lot if you have to spend this everytime you want to check some results of some paper...)
2.) the freedom to change the software to suit your needs (if you want to run tests with bug-fixes or just want to see the effect of some changes, this is absolutely necessary!)
3.) the freedom to share the software with your colleges (most scientific work is done in large working groups that span many institutes, so if you want to have a college cross-check your test, you need that.)
4.) the freedom to share the changes you make (your colleges should not only be able to run the test with the original code, but also with your bug-fixes etc.)
Now guess where these 4 rights are coming from... (http://www.gnu.org/licenses/quick-guide-gplv3.html)
The other question is that of copyright, licenses and patents.
Neither commercial nor free or open source license will protect you here. This is so much dependent on where you live, what you do with the code and so on, that nothing will help you here except two things:
1.) you have some modified BSD / public domain code, then you can do whatever you want. Nobody will care.
2.) you have GPL code and redistribute under the same terms. Than all is good as well.
You should really note that copyright and licenses are country dependent.
In Germany, the writer of the code is automatically the copyright holder, and no license can change that. But the writer may not have the right to distribute the code, if he sold the code, for example.
But you have to copy whole parts of the code to violate the copyright. Your clean-room stuff does not change anything here. If your code is not the same, it doesn't matter how you got your knowledge. And if it really is identical to other code, nobody will believe you that you used a clean-room approach...
Note also, that many end-user license agreements have no legal effect in Germany if they are handled in the way similar to that of Windows (e.g. agree after you bought the code, after the installation, etc)
I have one more thing to say here:
The use of a free software license is not only useful for peer-review, but is also in the spirit of Universities or other publicly funded research institutions because they help to gain and spread knowledge. So even if you come to the conclusion that free software is not required for peer review, there are a lot of good reason why you should only use free software, whenever you can.
The CHARMm example is not a very good one I think, because their license apparently *does* allow redistribution. Not of the whole, but at least of the bits you modified. That sounds pretty open source to me.
A truly closed source license would not allow redistribution of parts of the source code. The CHARMm license apparently does.
So, my counter question would be: how would you be able to show a flaw in the source code, if you would *not* be able to show the code which would need modification?
Now, regarding implementation of similar ideas. Some things cam only be implemented in a limited number of ways, such as the center of mass; if they algorithm is very exact in its specification, you'll have a hard time finding a unique implementation. For example, reading an XYZ file.
So, when the implementation is the only algorithmic description you have, this might indeed be tricky. And this would suggest that the paper describing the algorithm is a bad one. This is actually a well-known issue in the opensource community, and if you are unsure, the common approach is to contact the author on beforehand.
I am not aware of any conviction on copyright grounds because the reimplementation without reference to the original just happened to look much like the original.
Please go beyond publishing issues and fixing bugs.
The reason why I publish my source code under GPL is so that others can... you know... use the result of my research.
It is all nice to publish your algorithms, but we all know that it can take a lot of effort to implement an algorithm. People are not just going to try to reimplement your work because you are a nice person.
Admittedly, I know of few instances where people took my source code and produced another paper, however, I know several students who produced minor thesis using my source code. I also know of several industry engineers who used my source code.
Xubuntix: "If you can't see the code AND the data, you can't say anything about the results of someone."
In the abstract, that's not correct. If I say that I can factor numbers in Mersenne form from 2**243112609-1 through 2**500000000-1 then you can verify that because I can give you the factors. You don't need to see the algorithm to say if the results are correct or not.
Abstract examples are too hand-wavy for me. That's why my talk was all about real-world examples. CHARMm is a real world example. There are benefits to having a non-free license - it's easier for them to get financial support for future CHARMm development.
If people could apply for a hardship license, meaning that they assert that $600 is too much, then would that be acceptable?
Even if there isn't that special case, I again point out that the time needed to learn CHARMm (which is often done by being a student or postdoc in a CHARMm lab) plus the hardware to run CHARMm well, is more than the cost of the license, so the extra barrier to entry isn't that much.
My question is, why are the principles of the FSF a prerequisite to doing good science? I argue that they are not the same.
The details of copyright and licenses in different countries doesn't really apply to my second question. If I write software which happens to look similar to code I reviewed previously, how can be be sure that my code is safe from accusations of copyright violation? How can I reassure my clients that there will not be a future problem? This applies even if all the software is free, because the licenses can be incompatible.
And the issues of end-user licenses have nothing to do with this topic. My question is based on normal pure copyright law, the same laws which the GPL uses.
As regarding "the spirit of Universities or other publicly funded research institutions", I assert again that I'm not convinced that free software is a prerequisite for doing so, and I point out CHARMm as a counter-example.
".. whenever you can" is very wishy-washy. I like my Mac and it's not free. I could use something based only on a free system, but that would mean I can not do things like listen to multiple sound streams and have working wireless (reasons I gave up on a Linux-based system years ago).
Egon: The CHARMm example is not a very good one I think, because their license apparently *does* allow redistribution
I choose CHARMm because it is a great example. The academic license does not allow you to distribute the source code to people who don't have a CHARMm license, only to those with a CHARMm license.
It is not a free license. It is not an open source license. Not by anyone's definition.
Egon: Not of the whole, but at least of the bits you modified. That sounds pretty open source to me.
I don't follow you. Code you write using OEChem, a proprietary, closed-source toolkit, is written by you and you're free to distribute it any way you want.
Ahh, you think that if you change the CHARMm code then you can redistribute the modified module to anyone? No, I don't think so. Only that part of the code to which you own copyright.
And again, contribution of code covered by the CHARMm copyright can only be distributed to those with a CHARMm license.
Egon: how would you be able to show a flaw in the source code, if you would *not* be able to show the code which would need modification?
That really depends on the flaw, doesn't it? If it's a few lines then by US copyright law that falls squarely into fair use and you publish the few lines.
If it requires hundreds of lines to explain (which seems rather unlikely) then I suspect there are easier ways to show the problem. Plots of expected value over time, compared to computed, along with the CHARMm script for that case. An example input along with a description of the problem and a high-level description of the fault.
I've done a lot of code reviews. Very few problems require dozens of lines in order to explain.
Take a look at the description of the Morris worm by Eugene Spafford. It's a great read. It does a full dissection, including reporting bugs, without including more than a couple of lines of copyrightable code.
Egon: And this would suggest that the paper describing the algorithm is a bad one.
Interesting. So if the paper is good enough then there's no need to include the source, because someone should be able to reimplement it?
I gave the example too of where I review a paper for peer-review, which was then rejected. Since that code was never "published" (or was it, by my peer reviewal?), then apparently the paper was a bad one. But the same problem of non-clean room development still applies.
Egon: I am not aware of any conviction on copyright grounds because the reimplementation without reference to the original just happened to look much like the original.
That's the reason clean-room development exists. There can't be any conviction here.
This history of copyright suits is rich in details and example. A famous one is John Fogerty being sued for making a new song which sounded too much like an old one, where he had sold the copyright of the old one to someone else.
My old PI would mine old papers for text for new papers and grants. I don't think that's common. But since copyright to the old papers were transfered to the publisher, I think that's illegal. Of course, no sane science publisher is going to sue for that case.
Daniel Lemire: The reason why I publish my source code under GPL is so that others can... you know... use the result of my research.
There's a now long-established disagreement between those who favor for the GPL for this and those who favor the BSD for this. I'm in the BSD camp. I prefer to contribute to BSD programs over GPL ones, and I'm not the other one. There are others who are the other way.
See http://en.wikipedia.org/wiki/Free_software_license#The_Permissive_versus_Copyleft_controversy for details, in the unlikely case you haven't come across this before. I'm not trying to change anyone's mind on that controversy.
Daniel: I know several students who produced minor thesis using my source code. I also know of several industry engineers who used my source code.
And part of my code is in the C implementation of Python. I'm not against open source. I'm saying that I want a better arguments for why doing good science requires the right to redistribute the code and the other rights that free software requires.
It used to be the case and often still is that people distribute their software under an "academic license". This often boils down to "can't make a profit off of it."
How would having that sort of license - which is not free software - prevent good science?
Since I like real world examples, take a look at the nauty license by Brendan McKay. Nauty is a program for "computing automorphism groups of graphs and digraphs" and the algorithm therein is the basis for InChI.
The nauty license says: "Permission is hereby given for use and/or distribution with the exception of sale for profit or application with nontrivial military significance."
Does this prevent doing good science? How so? You've got access to the source, you can modify it, you can redistribute the changes. It's not free software, but does that hinder doing new science?
Well, except science which has strong military benefits.
The free software and open software web sites have strong positions against this sort of license. "No Discrimination Against Persons or Groups" in the Open Source Definition. But what impact does it have on doing good science?
Now I'll be hand-wavy - if he couldn't have those clauses on the code, would he have released the source code up and beyond publishing the algorithm? Would it have been a net detriment to science? That's just something to think about, but it carries no weight. The license that's there is the license that's there, and it isn't going to change.
Interestingly enough, nauty's license led to a clean room implementation of the algorithm found here: http://www.sagemath.org/doc/ref/module-sage.graphs.graph-isom.html
Interesting. So if the paper is good enough then there's no need to include the source, because someone should be able to reimplement it?
More detailed than an actual implementation you cannot get. But if you manage to explain a flaw in an algorithm without showing the full source, you think it would not be possible to describe the algorithm too, without the full source code?
I'm sure you can do peer review without OpenSource (please, do describe what you consider OpenSource...). OpenSource just simplifies things considerably. I've reviewed quite a few papers myself, seldom with any source code included.
I do peer review much of the Bioclipse and CDK commits. I can look at the calculated stuff the algorithms produces (calculated versus expected output), but that does not highlight pitfalls, which there certainly are. Hidden assumptions which might not always be true. Surely, your suggested review of proprietary code allows finding those too, and you can even do publication of flaws in limited ways.
But that's not the point of OpenSource and peer review. And the peer review is also not the only point for OpenSource.
However, OpenSource does simply peer review considerably. I can peer review, react on things, without having to wait for a license. And having abundant computing power around, $600 is an issue. It's 1/4th ACS conference visit, 1/4th OpenAccess paper. My funds are not so unlimited really, like that of many other PhDs and postdocs.
And in a world so complex as the ours, shouldn't we try to keep to simple things simple?
Mike: nauty's license led to a clean room implementation
Interesting indeed. I didn't know about that version. Thanks for pointing it out! I do know people who developed a BSD version of a library because they didn't like the implications of GPL. But that's part of the whole, well-worn BSD vs. GPL debate.
My question is, how do these specific non-free licenses hinder science? If you are philosophically inclined towards only free software then that's one thing. But if you argue that "free software is essential to doing good science" then that's another. And I don't believe the latter.
(In addition, the argument is likely that "GPL compatible licenses" are essential, since it's easy to come up with free-but-GPL-incompatible licenses, and I don't think that's what people actually want.)
Egon: please, do describe what you consider OpenSource...
As former abstracts reviewer for the Bioinformatics Open Source Conference I think I'll just do what I did then when people submitted "academic use only" talks, and point you towards the Open Source Definition. It's also what I did in an earlier comment on this post, about nauty.
Egon: But that's not the point of OpenSource and peer review. And the peer review is also not the only point for OpenSource.
And I never made those arguments. I'm saying that "free software is not an essential requirement to doing good software-based science." I like peer review. I like open source. I like access to the source code, and the ability to talk about the results of using that code (I'm looking at you Gaussian!). But there are also advantages to non-free software, even in research.
Free software says that it's morally wrong to have anything other than free software. That may be correct, though I disagree, but if that's the argument then this is a question of ethics in science, and not a question about doing good science.
Egon: And having abundant computing power around, $600 is an issue. It's 1/4th ACS conference visit, 1/4th OpenAccess paper. My funds are not so unlimited really, like that of many other PhDs and postdocs.
Funny enough, I don't submit peer-review publications in part because of the cost. Commercial rates are even higher than academic, and my company of just me makes less than your research group, which is able to afford good compute machines. I have a laptop.
But that's not my argument. I say that there can be advantages even in research software to having non-free software, as in the case of CHARMm where the license funding goes to pay for continued development and support of CHARMm.
Egon: And in a world so complex as the ours, shouldn't we try to keep to simple things simple?
I personally hate grant writing, and like knowing that I'll be able to afford a place to live. It would be simpler if I could convince people to give me money. ;)
Andrew: "free software is not an essential requirement to doing good software-based science."
Of course not. If you have enough eye balls, enough qualified scientists and developers, and good practices in your proprietary software development team, surely not.
Free software can just more efficient here. Does not have to be. There are many opensource packages around which are not developed in a Bazaar fashion, and these do not benefit from those efforts.
Check my blog:
http://chem-bla-ics.blogspot.com/2008/11/opendatasourcestandards-is-not-enough.html
I describe there that OpenSource != Open Project. The latter is important for peer review, not the former. If you manage to set up a Bazaar with your proprietary company X, surely that works too. It's just expensive, which is why you need to ask $600 per seat.
Peer review is primarily about the Bazaar model; and a Bazaar model just favours OpenSource models.
Andrew: But there are also advantages to non-free software, even in research.
So, what are the advantages of closed-source software for peer review and/or science?
Andrew: "Free software says that it's morally wrong to have anything other than free software"
Ah, but this depends on the FSF definition of free software. Remember the discussion about free software versus OpenSource! They are not the same! Don't mix that up. BSD is open source and surely has no problem with closed source bits. Same for LGPL, no worries about using that in proprietary code either.
I felt these things to be mixed earlier, which is why I asked to define opensource.
Andrew: I say that there can be advantages even in research software to having non-free software, as in the case of CHARMm where the license funding goes to pay for continued development and support of CHARMm.
The fact that companies are currently not commonly paying for opensource chemoinformatics software (they can! really! Contact me offline to learn about the deals I offer!), does not mean that closed-source software is the only answer. Not sure how well PyMOL is doing, but it seems still alive!
Andrew: It would be simpler if I could convince people to give me money. ;)
I totally agree with that.
Back on peer review. I think it comes down to this: the peer-review argument for OpenSource (not just GPL) is that it makes the process simpler. And please do explain me if you feel closed-source makes peer review easier. I have no clue how that would work.
Egon: If you have enough eye balls, enough qualified scientists and developers, and good practices in your proprietary software development team, surely not.
Part of my presentation at GCC was to point out that there aren't enough people on the open source projects. Linus's Law requires a big enough population. While money, which can come from licensing, can fund eyeballs.
Egon: Free software can just more efficient here. Does not have to be.
Yes. My talk was about problems of open source development in computational chemistry software. There's no way I can address the entire world of software development. I have to finish the final version of that text ...
Egon: Check my blog
I did, but I'm not talking about open projects. I'm talking about open source as a necessary requirement for doing science. You said once: "As a scientist, I take the position that any implementation must be open source; that's mere consequence of the scientific requirements for peer review and reproducibility." I disagree, and these are my arguments.
BTW, and this is a tangent, do you consider InChI to be an Open Project? I don't, but it's open source. On the other hand, in the context of large organization standards development, it is open.
Egon: what are the advantages of closed-source software for peer review and/or science?
To turn around your question, please define "closed-source". My examples here have been programs where the source code is accessible, the source code is even redistributable (to at least those who have the license) but the code is not open source.
If you allow CHARMm and nauty as closed-source packages (when they are non-open-source packages) then I've already described advantages. One is independence from fickle funding sources in order to support and continue future development, and to integrate packages from contributors.
And if you want me to defend software where there's no access to the source, wait until my full essay where I explain one of the difficulties that OpenEye had developing OELib through an open source license. Short version: funding problems, lack of contributions, people making local changes that were incompatible to updated versions, and questions about the long-term ability to support people with family and mortgages.
Egon: but this depends on the FSF definition of free software.
Umm, yes? Who else has a definition of free software? I did not anywhere mix the ideas of open source and free software. I used those terms very carefully, knowing full well the nuances and history of each.
Egon: The fact that companies are currently not commonly paying for opensource chemoinformatics software ... does not mean that closed-source software is the only answer.
I'm not arguing for "closed-source" software, but in the context of funding open source software I point out in my essay the comment from Roger Sayle about RasMol: Glaxo admitted that it made large use of RasMol, but had no mechanism for paying for RasMol except giving me money to work on RasMol there.
Egon: Not sure how well PyMOL is doing, but it seems still alive!
I also mention Warren. He's making money, and managed to tweak things to fit into the purchasing model at pharmas. Source is open, but only though version control. You purchase pre-compiled binaries and access to documentation. He's the only one I know making an independent living selling open-source software in this field. He's also one person, and he works hard at it. Compare to OpenEye, which is able to find 20+ people on their software.
How many additional contributors does Warren have? That is, how many eyeballs can an open source visualization program have? While at OpenEye there are a few who are paid full-time to work on it.
Egon: please do explain me if you feel closed-source makes peer review easier
Repeating to emphasize: "closed source" is not the opposite of "open source". "Source accessible" and
"code escrow" are other solution which give many of the same benefits (access to the source code, access to the source code in the future) without the requirement for open source.
If you have access to the source but can't redistribute it, is it closed source? If you have access to the source, can redistribute it to anyone, but are prohibited from publishing performance numbers, is it closed source?
Clearly there are many confounding issues. You do not have to agree with me; I'm not trying to convince you of anything.
From your last comments I distill that the reason why you do not like the peer-review argument, is for practical reasons: not enough people doing open source development yet, resulting in too few eyeballs; funding opensource does not work yet; resulting in too few eyeballs.
Or, there are practical reasons why peer review does not come out as good as it could be. But I have not seen any theoretically basis why open source does not simplify peer review. There is a lot in between closed source and open source... it's not black and white, and semi open help peer review already. That's what you said, at least.
So, when I say OpenSource is needed for good science (not sounds science, good science... ethical science whatever, that what I find good science, and IMHO there is enough bad cheminformatics), I mean (and I talk about cheminformatics, not computational tools): you need access to code and ability to redistribute your own ideas to simplify scientific reproduction of results, and speed up discovery of new knowledge. Sorry for the lack of these details in earlier quotes.
If you do not have rights to redistribute and modify source code, it is not Open Source; that's the definition, not an opinion.
As such, Open Source is not the requirement for peer review, and I never said that. Anything not allowing modification and redistribution is what people typically call proprietary, not?
Please don't pin me on the terminologies.
You mention benefit... open source benefits anyway. Not just those with a license, just anyone. And as such *simplifies* things (repeated emphasis).
Gratis software does not discriminate between companies who can and cannot pay for the license. Every company is equal. Simple.
It seems like your followup question provides an answer to your first question. If the code is released under an open-source license, there isn't really an issue with review and derived works so long as you're comfortable with the license it is under. For example, if the code is released under the BSD license, then you are free to release and distribute derived works after you review it.
Maybe I'm missing something?
(As a side note, as someone in mathematics, it seems crazy to me that people need to pay to have their articles published.)
Discussions on these issues are complex; and easily lead to heated threads.
I have nothing against closed source. If people like that, fine. Read my blog. I also strongly feel that science would significantly benefit if we all would take the opensource route. This does require change. Change in funding practices, change in trust, change in many other things.
I'm sure you can also pin me down on having said I really thing we must have world piece. I strongly do; but I do know there are many practical reason why we do not have world piece yet. But I do feel strongly that science would benefit from world piece.
BTW, very much looking forward to your complete essay!
Hi Andrew,
just a clearification on why I think that open data and open source are needed for good science.
Your example was a mathematical one:If I say that I can factor numbers in Mersenne form from 2**243112609-1 through 2**500000000-1 then you can verify that because I can give you the factors. You don't need to see the algorithm to say if the results are correct or not.
In astronomy, where I do research, publications are not that simple. They typically say, that they used 8k seconds of observation with some satellite, with some filters. And then the interesting parts start: they compute the real image based on estimations about the sattelite responce, PSF and so on. Then they make estimations on the background noise, maybe from Montecarlo computations of the detector, or other observations which have some properties, and then they do source extractions (of cause with some software) to remove the signal of these sources from the overall image, then they extract a spectrum from the remaining events and then they fit an absorption model based on estimations of intergalactic H abundances and in the end they fit absorption lines to the spectra based on models of the star athmosphere and then try to deduce the distance of the star from the redshift of the line.
Now if you think that someone can say something about that result WITHOUT having access to the data, models and sourcecode, then ... I don't know. Maybe then, it's just not worth reading this blogpost any longer...
Hello Andrew! Leaving aside the philosophical arguments (which I wouldn't normally do, since I think they are of critical importance), I would argue that the main practical argument for open source (or Free Software) licensing of software used to produce results for publication would be in the resulting ability of *anyone* to attempt to reproduce those results without arbitrary barriers to doing so, whether these would be financial, contractual, or anything else that would dissuade all but the most determined individuals or organisations.
Your argument seems to hinge on the motivation of people who would want to reproduce or verify results: why would someone outside the field, in the "vanilla" open source community, want to review or improve the code of a molecular dynamics program, especially if they have no special interest or education in that field? The implication is that everyone who would be interested would already be licensing the software, anyway, but as others have pointed out, this need not be the case: research employing the software may be of a certain level of interest to someone, but they may not be compelled to look at it, particularly if there is other work in a potentially long list of publications which can be more readily evaluated. And as disciplines overlap, there are lots of things that outsiders can offer if they are given the opportunity to participate - turning them away isn't likely to bring scrutiny and the corresponding fortuitous advances that would otherwise occur.
From the perspective of a producer of software, results and publications, any required software or data which is not open source (or open access in the case of data) is an inconvenience since it means that any recipient of the work cannot readily obtain all the components without needing to jump through several hoops - those arbitrary barriers - in order to assemble the parts of a working system. Using some component with a strict non-commercial licence? It can't be distributed. Using some data where you have to register to get access? It obviously can't be distributed, either. And some of these software or access licences could quite easily disqualify potential users. Outside the USA? "You can't do this, this or this." This isn't exactly going to encourage others to take the work and build on it, and the consequence can quite often be the perpetuation of a landscape of competing projects whose scope expands very slowly and the groundwork gets repeated over and over again.
Once all these restrictions pile up, you either end up with papers being the main method for communicating knowledge because it's just too awkward to follow the processes described in each paper, or you have to "go through" the people who wrote the paper and collaborate with them, with all the politics that this may well entail. That might have been efficient for certain disciplines at some point, but I think it's hard to argue the case for it now.
Xubuntix: Now if you think that someone can say something about that result WITHOUT having access to the data
My essay isn't about having access to the source code. It's asking why the right to redistribute the source code is essential to doing peer review.
The rest of this thread is tangential. I happen to disagree with your statement: If you can't see the code AND the data, you can't say anything about the results of someone.. My counter-example uses primes because factorization is one of a class of functions which is hard to solve but easy to test. If I say that the factors of some huge number can be decomposed to a set of smaller numbers, you can easily verify that I'm correct, without seeing my source code.
You then gave an example drawn from astronomy. That some cases are better verified with source access is not the same as saying that you can't say anything about the results without access to the source.
Still, even your example didn't explain why the requirements of open source (right to redistribute, right to charge money) were essential to effective peer review.
(BTW, you do realize that your OS still uses non-free software, right? Or do you use a machine with a free BIOS? Have you done effective peer-review of your CPU? Could you if you wanted to?)
Egon: And having abundant computing power around, $600 is an issue. It's 1/4th ACS conference visit, 1/4th OpenAccess paper. My funds are not so unlimited really, like that of many other PhDs and postdocs.
I've been thinking about this some more and realized there was a conflict between comments like Egon's and the freedoms of free source. The FSF says we encourage people who redistribute free software to charge as much as they wish or can.
Even if CHARMm was free software, they could still charge the $600 to get a copy of the most recent version. They could even charge $1,000,000 to pull one from the archive, which someone else might want to use to verify an old result.
If I publish something in a paper, and distribute the algorithm under an open source license, and I charge $100,000 for access to the source code (I only need to sell it once to make a profit!), would that be acceptable for the needs of good peer review?
I don't think so.
The requirements of free software are not those of peer review, and one is not a subset of the other.
BTW, thanks for the feedback on this! I'm going to update my essay from GCC to include some of the points I've worked on here, including this one.
Andrew, maybe I missed it, but have you posted your slide deck on open source somewhere on the public Web?
Post a Comment