Thursday, February 24, 2011
Don't concatenate Strings when calling the LoggingTool
Logging in the CDK can be turned off, and in production use, I often do. Therefore, we can make it speed up by concatenating debug message String only when debugging is turned on. The CDK LoggingTool can take care of this, so you should pass the Strings are individual parameters to the logger call. Instead of:
logger.debug("the current atom is " + atom);
you should write:
logger.debug("the current atom is ", atom);
And this works for any arbitrary number of Strings and objects, because the logging tool debug method has this signature: public void debug(Object object, Object... objects). So, you can also just do:
logger.debug("The ", i, "th object is connected to the ", j, "th");
Actually, looking at the debug() method's signature, I am not sure why we still have the first Object separately... mmm.
Most of you have already developed a love/hate relation with the CDKAtomTypeMatcher. This is the class that complains about atom types not being recognized. And let me stress this one more time. When you get such a method, it means that the CDK cannot add hydrogen, it cannot calculate QSAR descriptors, etc, etc. I have had the comment that CDK 1.0 never complained about that. True. It just ignored the fact that it did not know what to do with that atom, and calculated (wrong) properties anyway.
The solution to such a missing atom type warning can be two-fold. First, the input data was wrong, for example, because formal charges were lost at some point. A typical example is a neutral four-coordinated nitrogen. Second, the input data is correct and the CDK is wrong. This doesn't happen too often, but it does.
This last is the topic of this post. The CDK can be 'wrong' in two ways. Either it doesn't know the atom type (CDK 1.2 and 1.4 have more atom types than CDK 1.0 ever had; so), or the perception algorithm makes a mistake. The latter is not uncommon, particularly for some elements, like nitrogen. The cause here is the input data, or better, the lack of input data. Missing bond orders, missing explicit hydrogens. Now, cheminformatics can make educated guesses, and so the perception algorithm does. But it depends on its education.
Doing the needed education of the CDK basically looks like this commit. It adds:
- a new atom type to the ontology,
- adds code to the CDKAtomTypeMatcher for the perception, and
- adds unit tests to make sure it does the right thing.
- the element (duh),
- the formal charge,
- the number of bound neighbors,
- the number of double bond equivalents (piBondCount),
- the number of lone pairs, and
- the hybridization state (sp3, sp1, etc).
Monday, February 21, 2011
In fact, the Open Data definition as outlined in the Panton Principles does not allow for a non-commercial clause, which several LODD data sets are labeled with. Clear copyright and license statements are really important. This applies to source code, but most certainly to data too.
So, early March there will be a virtual, online hack session on making a start clarifying the licensing and copyrights of the various LODD data sets. CKAN, the OKF's registry of data sets sounds like a suitable place to organize things. And, they support RDF making it even a nicer match. Or...
Records, data sets, packages, lists, ...
Well, I am running into a number of problems that need to be solved. The first one is, in fact, rather fundamental. A record in the CKAN catalog is typed as a CatalogRecord. Nothing more, nothing less. However, looking at the database content each record is like a data set. The GUI confirms that with links like 'Add a dataset'. Then again, following the link, it is described as a 'data package' as well as a 'dataset'. This is confusing.
Yes, it is! Really! Just look at how people are using it.
Take for example the record for LODD (well, there is a second LODD record and I haven't found a way to delete records). But, the LODD data is not a single data set, but a aggregation (package?) of data sets. This is basically one of the standing issues with Linked Open Data: license incompatibilities, just like with mixing GPL v2 and v3 in source code. As such, these records which are basically lists of datasets typically not have a license listed, as multiple license apply.
There seem two solutions: the use of groups and of tags. I opted for a new W3C's HCLSIG LODD task force group.
As said, I could not find a way yet to remove datasets, so cleaning up the catalog is a bit difficult. You can add comments, but I am not sure if these get read. Earlier, I noted that the GNU FDL license was missing, but that is available from the list now, so I have updated the NMRShiftDB record. The combination of Attribution and Share-Alike for the Creative Commons license is missing from the list though, affecting the ChEMBL record.
Another issue that must be addressed in CKAN is how to deal with redistributions. For example, the above cited ChEMBL data is also available as SPARQL end point by me, and this end point currently has a different record. Should those records be merged? I guess not, because there is only place for one maintainer. So, perhaps a CKAN catalog record is not even a dataset, but a data set provider?
Well, this does make a really nice example of what can go wrong if terms are not well-defined, e.g. using an ontology. People do not know how to fill the database, leading to noisy content, limiting the usefulness of the data. I do hope we get to see definitions for what groups and datasets are in the catalog before our hack session in March.
Thursday, February 17, 2011
It comes down to the fact that the website has way too much information, and uses way too much space for things that do not matter. The interesting content only starts half-way my screen. I tend to have given up finding the interesting bits by then.
But even within each thread there is room for improvement. There too, there is abundant whitespace, though that might prove functional when the layout becomes more compact. Also, I life the photo of the people whom I follow to be removed (something that is done with a userscript only on FriendFeed too!). Instead, it should be clear where to thread comes from. (Regarding that, I am not sure if any RSS feed you can now list, will actually show up in this activity stream yet.)
As a nice spin-off if gives you a star rating for how well you are into the social science networks:
I got eleven stars, but that should not surprise you.
Monday, February 14, 2011
Last Friday a virus kicked in. Don't know whether flu or cold, but it had already ruined my 5-6 February weekend too. I need high fever to keep bed, otherwise I just can't. Worse, when having an elevated temperature, I cannot think clearly anymore, and I haven't been able to think clear since about last Wednesday, so I should have seen it coming. Nor can I think clear today, leading to incomprehensibly question on BioStar, and the general lack of progress in writing and other important things, like planning.
Sunday, February 13, 2011
Well, and I got them wrong. A few things had to happen. First of all, I had to revert the JNI-InChI library to a plugin, as I have no experience with making OSGi bundles out of random jars (there is this bnd tool which exists for that task, but never used it before). But, the upgrade brought in version 0.7 which should solve the InChI library loading issue, as well as bring InChI to various more platforms. So, the JNI-InChI jar was pulled from the target platform, and the rebuild worked out of the box.
The second thing I did was upgrade the org.openscience.cdk bundles. This repository had three branches, one for pure CDK, now updated to CDK 1.3.8, a CDK-JChemPaint branch updated to version 17, and a bioclipse2.4 branch for Bioclipse 2.x (we now and then patch things to make it work, so that we do not have to wait for them to be applied upstream, with the CDK itself). I thought it would be good to create a separate bioclipse2.6 branch for Bioclipse master, but that was a mistake. This branch information is scattered over the .rmap and .cquery files, and they are not all in one single git repository, and those that are, are not synchronized with the versions on Hudson either. It's normal in prototyping, but not helping me. After a couple of hours fiddling, I got this one compiling too.
Right now, I'm stuck with getting bioclipse.cheminformatics compiling again, and looking for ERRORs in an enormous log file. If only ERROR messages were a bit more descriptive, something like "Hej, you said to look for org.openscience.cdk.sinchi in file foo.bar, but I cannot find it anywhere.", then I'd have some clue where to look.
That makes me wonder if anyone actually ever wrote up best practices for error messages...
BTW, I also still have another, independent dependency issue to solve: the one for Google's Guava, so that everyone can play with the Bioclipse-Google interaction.
Wednesday, February 09, 2011
If we are in the field of property prediction, knowing what tautomer to calculate descriptors for is crucial. Not that we actually have easy access to experimental data showing what the important tautomer is for our end-point (predicted property), but at least we can track what tautomer we modeled with. Has everyone ever asked you to add units to experimental values? Like "the temperarature was 279 degrees; Celsius or Kelvin??" Well, this is the exact same thing. If your QSAR model training report does not include that information, you are doing it wrong. (/me ducks)
So, why does it in fact matter? It matter simply because calculated properties are different. Backing up to the ChemSpider example in my question about InChIs with the fixed-hydrogen layer I noted that (like in many other databases) the synonyms seems to include IUPAC names for at least two tautomers. However, while the ChemSpider is, in fact, for the tautomer-independent structure (using the InChI mobile hydrogen layer; and keep in mind that the InChI uses only a limited amount of heuristic rules for identifying tautomers, making it not detect all 40 tautomers of warfarin), the 2D diagram, the 3D model, and the calculated properties reflect only one tautomer.
And calculated properties are exactly the input in QSAR's statistical modeling. It is interesting to realize that the differences in calculated molecular descriptors can vary both minimally, or not at all, as drastically. Very drastically, in fact. The recent paper by Porter (doi:10.1007/s10822-010-9335-7) shows the 40 warfarin tautomers, and discusses a few properties, such as the pKa. The experimental pKa of warfarin is around 5. Now, the paper reports calculated pKa values for a variety of software products (AMBIT is unfortunately missing). First of all, it shows that the various tools differ, which is to be expected. But that variance is neglectable when compared by the effect of picking the wrong tautomer. I was impressed by the range of predicted values for the various tautomers. I ranged from about 5 to 12, throughout all tools. That means warfarin is predicted to be mildly acidic (some tools predict pKa's down to 2.5) to very basic! No way your statistical modeling will understand that!
And this is why Open Data is so important in chemistry. So, the next time Joe (Organic) Chemist bitches about computers and cheminformatics, tell him it is his own fault: he should have released his data out in the Open.
Anyway. Tautomerism was a curation issue in the first(!!!) entry I was curating. The sixth had the more well-known problem, I think. I may be blind, but I would say this drug has a stereocenter:
But none of the databases I checked so far (including ChemSpider) defines the stereochemistry! I thought we settled that some decades ago? Stereochemistry of drugs matter. What is going on here? I guess I have to browse some primary literature and access some experimental data today then. If I can afford it.
Porter, W. (2010). Warfarin: history, tautomerism and activity Journal of Computer-Aided Molecular Design, 24 (6-7), 553-573 DOI: 10.1007/s10822-010-9335-7
Sunday, February 06, 2011
Some project are never finished. Neither is this one, but it is never too late to change how things work, so, taking advantage of publishing-on-demand, here I introduce the release-soon, release-often equivalent of cheminformatics books, my Groovy Cheminformatics with the Chemistry Development Kit book:
With a serious discount for just being the first edition (1.3.8-0), but still counting at 72 pages with 75 code examples, this edition marks a personal milestone (and probably not much more than that). There remains much to do, but I promised a release by tomorrow, so here it is. Next releases will contain more code examples, more functionality descriptions, and more literature reviewing where such code is used in science. The plan is to make new editions with each new CDK release, as well as new editions when I added a new chapter, section, or just paragraph. But, there will not be a Nightly build service anytime soon.
The current table of content is as follows:
Now, the book content is not open content. However, it contains nothing that is not available in other means. It's just the compilation that makes this book interesting, as well as that I put effort in ensuring the code examples remain working. For that, I ask a minor financial contribution.
Wednesday, February 02, 2011
About 1.5 years ago I started a FriedFeed room Open Chemical Data where RSS feeds of new chemical data are aggregated. It started with the RSS feed from the NMRShiftDB, and later ChemPedia Substances was added (which is expected to go EOL). I was informed, however, about Dryad. It is a website that allows deposition of data published in journals, as CC0. That latter is amazing! And when they announced a new twitter feed, also noted they had a RSS feed of new data sets. I asked about feeds specific for chemistry, which prompted Ryan to set up this Yahoo Pipe.
So, Dryad is now part of the Open Chemical Data room (second item):
book.bib: @echo "" > book.bib @wget -O - http://www.citeulike.org/bibtex/group/14384 >> book.bib
BTW, does anyone have experience with the new biblatex?
Tuesday, February 01, 2011
Arvid has done an excellent job in setting up Hudson for Bioclipse, which is not so trivial as for Maven projects. With the R functionality I blogged about earlier today moved to the bioclipse.statistics repository, things are automatically build when new patches hit the repository. And Hudson doesn't just compile the code, it also exports binaries. The main Bioclipse products build can be downloaded here (Windows, Linux, OS/X; if you need more, please let use know). And, all the additional features got automatically uploaded to individual update sites, very much like Maven projects getting pushed to a Maven repository.
So, here's an Bioclipse-R install guide for the adventurous (i.e. Christian, Rajarshi and Rebecca, perhaps Juuso :). I will describe it from a Debian GNU/Linux perspective.
First, make sure you have R and rJava installed. I strongly recommend the versions from sid/unstable (2.12 and 0.8-8), which are more stable here than those from testing:
$ apt-get install r-base-core/sid r-cran-rjava/sidThen you download the nightly build of Bioclipse 2.5.x from Hudson, e.g. for my 32bit Linux:
$ cd /tmp $ wget http://pele.farmbio.uu.se/hudson/job/Bioclipse.product/lastSuccessfulBuild/artifact/tmp/bioclipse/net.bioclipse.platform_feature_2.4.0-eclipse.feature/Bioclipse.2.5.0.linux.gtk.x86.zip $ unzip Bioclipse.2.5.0.linux.gtk.x86.zip $ cd Bioclipse.2.5.0.linux.gtk.x86/But before we boot Bioclipse, we first need to tell it where to find the R instance, and we need to set two variables for that: R_HOME and java.library.path. The latter is set via the bioclipse.ini file, by adding this line to the end:
-Djava.library.path=/usr/lib/R/site-library/rJava/jriThe other is an environment variable and can be set with:
$ export R_HOME=//usr/lib/RThen Bioclipse can be started:
$ ./bioclipseThe first thing to do is install the R Feature. To do this open the Install New Software... dialog from the Help menu, and add (with the Add button) the Bioclipse Statistics update site on Hudson at this Location (copy/paste the Location URL into the below dialog):
After the install is completed, which will require a reboot of Bioclipse (why?!?!), you can open the R console by opening a dialog from the Window -> Show View -> Other... menu:
The R Console will then show up as view, as depicted in my earlier post.
Loading of libraries, like running library("rcdk") or library("pls"), crashes the whole thing if you use R and rJava from Debian testing, and silently fails with those from unstable. It's on my agenda.
One rarely know feature of Bioclipse is that you use it to run R code. This functionality has been around for a while from experimental, and recently updated to use rJava by Carl. In the next months we will improve passing around data between Bioclipse and the R session, providing a richer chemometrics experience. And, with a bit of luck, it will be part of the upcoming 2.6 release.