iPhylo: March 2008

Roderic D. M. Page

Thursday, March 20, 2008

Phylowidget

Greg Jordan and Bill Piel have released PhyloWidget, a Java applet for viewing phylogenetic trees. It's very slick, with some nice visual effects courtesy of Processing.
PhyloWidget is open source, with code hosted by Google code. I'm a C++ luddite, so it took me a few moments to figure out how to build the applet, but it's simple enough, just type

ant PhyloWidget

at the command prompt. I got a couple of warnings about missing .keystore files (something to do with signing the applet), but otherwise things seemed to work.
The applet has a URL API, which makes it easy to view trees. For example, try this link to view the Frost et al. amphibian tree (doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2).

Systematics as Cyberscience

Vince Smith alerted me to "Systematics as Cyberscience", by Christine Hine, whose work I've mentioned earlier. Looks like an interesting read. From the publisher's blurb:

The use of information and communication technology in scientific research has been hailed as the means to a new larger-scale, more efficient, and cost-effective science. But although scientists increasingly use computers in their work and institutions have made massive investments in technology, we still have little idea how computing affects the way scientists work and the kind of knowledge they produce. In Systematics as Cyberscience, Christine Hine explores these questions by examining the developing use of information and communication technology in one discipline, systematics (which focuses on the classification and naming of organisms and exploration of evolutionary relationships). Her sociological study of the ways that biologists working in this field have engaged with new technology is an account of how one of the oldest branches of science transformed itself into one of the newest and became a cyberscience.

Monday, March 10, 2008

Google's Social Graph API

Google's Social Graph API was released earlier this year.

The motivation:

With so many websites to join, users must decide where to invest significant time in adding their same connections over and over. For developers, this means it is difficult to build successful web applications that hinge upon a critical mass of users for content and interaction. With the Social Graph API, developers can now utilize public connections their users have already created in other web services. It makes information about public connections between people easily available and useful.

Apart from the obvious application to scientific databases (for example, utilising connections such as co-authorship), imagine the same idea applied to data.

Sunday, March 09, 2008

CrossRef adds more information to OpenURL resolver

Tom Pasley recently drew my attention to CrossRef's addition of a XML format parameter to their OpenURL resolver. Adding &format=xml to the OpenURL request retrieves bibliographic metadata in "unixref" format (for those who like this sort of thing, the XML schema is here). The biggest change is now the metadata lists more than one author for multi-author papers.

I tend to use JSON for my work now, so a common task is to convert XML data streams into JSON. I've modified my bioGUID OpenURL resolver to make use of the unixref format, which meant I had to write a XSLT file to convert unixref to JSON. If you're interested, you can grab a copy here. It's not pretty, but it seems to work OK.

For some years now I've relied on Marc Liyanage's excellent tool TestXSLT to develop XSLT files. If you have a Mac and work with XSLT, then I do yourself a favour and grab a copy of this free tool.

Thursday, March 06, 2008

PageRank for biodiversity

This will probably tempt fate, but I've an invited manuscript in review for Briefings in Bioinformatics on the topic of identifiers in biodiversity informatics. Readers of this blog will find much of it familiar (DOis, LSIDs, etc.). For fun I constructed a graph for three ant specimens of Probolomyrmex tani, and the images, DNA sequences, and publications that link to these specimens.

Based on this graph I computed the PageRank of each specimen. The motivation for this exercise is that AntWeb lists 43 specimens of this species, in alphabetical order. This is arbitrary. What if we could order them by their "importance"? One way to do this is based on how many times the specimens have been sequenced, photographed, or cited in scientific papers. This gives us a metric for ordering lists of specimens, as well as demonstrating the "value" of a collection (based on people actually using it in their work). I think there is considerable scope for applying PageRank-like ideas to questions in biodiversity informatics. Robert Huber has an intriguing post on TaxonRank that explores this idea further.

Word for the day - "transclusion"

Stumbled across Project Xanadu, Ted Nelson's vision of the way the web should be (e.g., BACK TO THE FUTURE: Hypertext the Way It Used To Be). Nelson coined the term "transclusion", including one document in side another by reference. The screen shot of Xanadu Space may help illustrate the idea:

Nelson envisages a web where instead of just one-way links, documents include parts of other documents, and one can view a document side-by-side with the source documents. Modern web browsers transclude images (the image file is not "physically" in the document, rather it exists elsewhere), but mostly they link to other documents via hyperlinks.
Ted Nelson's writings are a fascinating read, partly because they remind you just how much of the web we take for granted, and how thinks could be different (better?). One thing he objects to is that much of the the web simulates paper

Much of the field has imitated paper: in word processing (Microsoft Word and Adobe Acrobat) and the World Wide Web, whose rectangular page layouts become a focal issue. It should be noted that these systems imitate paper under glass, since you can't annotate it.

Nelson also advocates every element of a document having its own unique address, not just at book or article level. This resonates with what is happing with digital libraries. Gregory Crane in his article "What Do You Do with a Million Books?" (doi:10.1045/march2006-crane) notes that:

Most digital libraries still mimic their print predecessors, treating individual objects – commonly chunks of PDF, RTF/Word, or HTML with no standard internal structure – as its constituent units. As digital libraries mature and become better able to extract information (e.g., personal and place names), each word and automatically identifiable chunk of words becomes a discrete object. In a sample 300 volume, 55 million word collection of nineteenth-century American English, automatic named entity identification has added 12,000,000 tags. While this collection focuses on name rich historical materials and includes several reference works, this system already discovers thousands of references to named entities in most book length documents. We thus move from single catalogue entries with a few hundred words to thousands of tagged objects – an increase of at least one order of magnitude with named entities and of at least two orders of magnitude when we consider each individual word as an object.

I discovered Crane's paper via Chris Freeland's post On Name Finding in the BHL. Chris summarises BHL's work on scanning biodiversity literature and extracting taxonomic names. BHL's output is at the level of pages, rather than articles. Existing GUIDs for literature (such as DOIs and SICIs) typically identify articles rather than pages (or page elements), so there's a need to extending these to pages.

Chris also raises the issue of ranking and relevance -- "What do you do with 19,000 pages containing Hymenoptera?". One possibility is to explore Robert Huber's TaxonRank idea (inspired by Google's PageRank). This would require text mining to build synonomy lists from scanned papers, challenging but not impossible. But I suspect that the network of citations is what will help build a sensible way to rank the 19,000 pages.

A while ago people were speculating what Google could do to help biodiversity informatics. I found much of this discussion to be vague, with no clear notion of what Google could actually do. What I think Google is exceptionally good at are two things we need to tackle -- text mining, and extracting information from links. I think this is where BHL and, by extension, EOL, should be devoting much of their resources.