Please read “It’s time to reboot bioinformatics education”

View the article’s original source
Author: nsaunders

I guess I’ve been around bioinformatics for the best part of 15 years. In that time, I’ve seen almost no improvement in the way biologists handle and use data. If anything I’ve seen a decline, perhaps because the data have become larger and more complex with no improvement in the skills base.

It strikes me when I read questions at Biostars that the problem faced by many students and researchers is deeper than “not knowing what to do.” It’s having no idea how to figure out what they need to know in order to do what they want to do. In essence, this is about how to get people into a problem-solving mindset so as they’re aware, for example that:

  • it’s extremely unlikely that you are the first person to encounter this problem
  • it’s likely that the solution is documented somewhere
  • effective search will lead you to a solution even if you don’t fully understand it at first
  • the tool(s) that you know are not necessarily the right ones for the job (and Excel is never the right tool for the job)
  • implementing the solution may require that you (shudder) learn new skills
  • time spent on those skills now is almost certainly time saved later because…
  • …with a very little self-education in programming, tasks that took hours or days can be automated and take seconds or minutes

It’s good (and bad) to know that these issues are not confined to Australian researchers: here is It’s time to reboot bioinformatics education by Todd Harris. It is excellent and you should go and read it as soon as possible.

Filed under: bioblogs, bioinformatics, blogroll, education

PubMed retraction reporting update

View the article’s original source
Author: nsaunders

Just a quick update to the previous post. At the helpful suggestion of Steve Royle, I’ve added a new section to the report which attempts to normalise retractions by journal. So for example, J. Biol. Chem. has (as of now) 94 retracted articles and in total 170 842 publications indexed in PubMed. That becomes (100 000 / 170 842) * 94 = 55.022 retractions per 100 000 articles.

Top 20 journals, retracted articles per 100 000 publications

Top 20 journals, retracted articles per 100 000 publications

This leads to some startling changes to the journals “top 20″ list. If you’re wondering what’s going on in the world of anaesthesiology, look no further (thanks again to Steve for the reminder).
Filed under: R, statistics Tagged: pmretract, pubmed, retraction, rmarkdown

PMRetract: PubMed retraction reporting rewritten as an interactive RMarkdown document

View the article’s original source
Author: nsaunders

Back in 2010, I wrote a web application called PMRetract to monitor retraction notices in the PubMed database. It was written primarily as a way for me to explore some technologies: the Ruby web framework Sinatra, MongoDB (hosted at MongoHQ, now Compose) and Heroku, where the app was hosted.

I automated the update process using Rake and the whole thing ran pretty smoothly, in a “set and forget” kind of way for four years or so. However, the first era of PMRetract is over. Heroku have shut down git pushes to their “Bamboo Stack” – which runs applications using Ruby version 1.8.7 – and will shut down the stack on June 16 2015. Currently, I don’t have the time either to update my code for a newer Ruby version or to figure out the (frankly, near-unintelligible) instructions for migration to the newer Cedar stack.

So I figured now was a good time to learn some new skills, deal with a few issues and relaunch PMRetract as something easier to maintain and more portable. Here it is. As all the code is “out there” for viewing, I’ll just add few notes here regarding this latest incarnation.

  1. Writing in RMarkdown has several advantages:
    • There are the usual advantages of literate documents – seeing the code together with the results, reproducibility.
    • Parsing PubMed XML files directly using R is an easier, more “lightweight” process than storage, retrieval and visualisation via a dedicated database.
    • The output is a single HTML file which is easy to distribute or host: for example here at Github and here, published to Rpubs using RStudio. Grab it yourself, use it however you like.
  2. There are a couple of slow procedures (several minutes) that are better run from separate R scripts than from the RMarkdown document, for debugging purposes. These are (a) downloading PubMed XML and (b) retrieving total articles per year across five decades. Those scripts are here at Github. The RMarkdown document then reads their output.
  3. This project allowed me to explore the rCharts package. I had long wondered why, given the excellent plotting capabilities of R, anyone would want to provide a wrapper to javascript plotting libraries. The answer of course is that with tools such as RMarkdown, we can generate documents in HTML format where interactive javascript shines.
  4. Highcharts is still my library of choice. I know the cool kids use D3 but (a) I know Highcharts better and (b) I find the transformation between data and its graphical representation most intuitive in Highcharts. That’s just how my brain works, not a reflection of the other libraries.
  5. The publishing procedure is not quite so fully-automated as it was using Rake; this shell script is my best attempt so far. However, it’s easy enough to compile and publish the document using RStudio whenever the notification feed updates.
  6. A couple of enhancements:
    • The clunky, confusing zoomable timeline showing retractions on specific dates has been replaced by a non-zoomable version showing retraction counts per year.
    • There’s always been some confusion as to whether we’re looking at data for retracted articles or their associated retraction notices – so now both types of data are shown, in separate clearly-labelled and coloured plots.

That’s it, more or less. Enjoy and let me know what you think.

Filed under: programming, R, statistics, web resources Tagged: pmretract, pubmed, retraction, rmarkdown, rpubs

Just how many retracted articles are there in PubMed anyway?

View the article’s original source
Author: nsaunders

I am forever returning to PubMed data, downloaded as XML, trying to extract information from it and becoming deeply confused in the process.

Take the seemingly-simple question “how many retracted articles are there in PubMed?”

Well, one way is to search for records with the publication type “Retracted Article”. As of right now, that returns a count of 3550.

library(rentrez)

retracted <- entrez_search("pubmed", ""Retracted Publication"[PTYP]")
retracted$count
[1] "3550"

Another starting point is retraction notices – the publications which announce retractions. We search for those using the type “Retraction of Publication”.

retractions <- entrez_search("pubmed", ""Retraction of Publication"[PTYP]")
retractions$count
[1] "3769"

So there are more retraction notices than retracted articles. Furthermore, a single retraction notice can refer to more than one retracted article. If we download all retraction notices as PubMed XML (file retractionOf.xml), we see that the retracted articles referred to by a retraction notice are stored under the node named CommentsCorrectionsList:

        <CommentsCorrectionsList>
            <CommentsCorrections RefType="RetractionOf">
                <RefSource>Ochalski ME, Shuttleworth JJ, Chu T, Orwig KE. Fertil Steril. 2011 Feb;95(2):819-22</RefSource>
                <PMID Version="1">20889152</PMID>
            </CommentsCorrections>
        </CommentsCorrectionsList>

There are retraction notices without a CommentsCorrectionsList. Where it is present, there are CommentsCorrections without PMID but always (I think) with RefSource. So we can count up the retracted articles referred to by retraction notices like this:

doc.retOf <- xmlTreeParse("retractionOf.xml", useInternalNodes = TRUE)
ns.retOf <- getNodeSet(doc.retOf, "//MedlineCitation")
sources.retOf <- lapply(ns.retOf, function(x) xpathSApply(x, ".//CommentsCorrections[@RefType='RetractionOf']/RefSource", xmlValue))

# count RefSource per retraction notice - first 10
head(sapply(sources.retOf, length), 10)
# [1] 0 1 1 1 1 1 1 1 1 1

# total RefSource
sum(sapply(sources, length))
# [1] 3898

It appears then that retraction notices refer to 3 898 articles, but only 3 550 of type “Retracted Publication” are currently indexed in PubMed. Next question: of the PMIDs for retracted articles linked to from retraction notices, how many match up to the PMID list found in the downloaded PubMed XML file for all “retracted article” (retracted.xml) ?

# "retracted publication"
doc.retd <- xmlTreeParse("retracted.xml", useInternalNodes = TRUE)
pmid.retd <- xpathSApply(doc.retd, "//MedlineCitation/PMID", xmlValue)
# "retraction of publication"
pmid.retOf <- lapply(ns.retOf, function(x) xpathSApply(x, ".//CommentsCorrections[@RefType='RetractionOf']/PMID", xmlValue))

# count PMIDs linked to from retraction notice
sum(sapply(pmid.retOf, length))
# [1] 3524

# and how many correspond with "retracted article"
length(which(unlist(pmid.retOf) %in% pmid.retd))
# [1] 3524

So there are, apparently, 26 (3550 – 3524) retracted articles that have a PMID, but that PMID is not referred to in a retraction notice.

In summary
It’s like the old “how long is a piece of string”, isn’t it. To summarise, as of this moment:

  • PubMed contains 3 769 retraction notices
  • Those notices reference 3 898 sources, of which 3 524 have PMIDs
  • A further 26 retracted articles have a PMID not referenced by a retraction notice

What do we make of the (3898 – 3550) = 348 articles referenced by a retraction notice, but not indexed by PubMed? Could they be in journals that were not indexed when the article was published, but indexing began prior to publication of the retraction notice?

You can see from all this that linking retraction notices with the associated retracted articles is not easy. And if you want to do interesting analyses such as time to retraction – well, don’t even get me started on PubMed dates…

Filed under: bioinformatics, programming, publications, R, statistics Tagged: ncbi, pubmed, retraction

From PMID to BibTeX via BioRuby

View the article’s original source
Author: nsaunders

Chris writes:

Nothing like searching for an answer (PMIDs->Bibtex) and finding someone else pointing back to your own solution! http://t.co/ZOm0cK6o0d

— Chris Miller (@chrisamiller) March 17, 2015

The blog post in question concerns conversion of PubMed PMIDs to BibTeX citations. However, a few things have changed since 2010.

Here’s what currently works.

Filed under: bioinformatics, programming, ruby Tagged: bioruby, eutils, pubmed

Note to journals: “methodologically sound” applies to figures too

View the article’s original source
Author: nsaunders

PeerJ, like PLoS ONE, aims to publish work on the basis of “soundness” (scientific and methodological) as opposed to subjective notions of impact, interest or significance. I’d argue that effective, appropriate data visualisation is a good measure of methodology. I’d also argue that on that basis, Evolution of a research field – a micro (RNA) example fails the soundness test.

Figure 1: miRNA publications per year

Figure 1: miRNA publications per year

Let’s start with Figure 1. Equally spaced divisions on the x-axis, but the years are not equally spaced – 1993, 1996, 1997 for example. Even worse is the attempt to illustrate a rapid increase after 2004 using broken bars and a second y-axis. It’s confusing and messy.

@neilfws @thePeerJ someone needs to learn about log scales…

— Chris Cole (@drchriscole) March 17, 2015

Figure 2: language of publication

Figure 2: language of publication

Some of these crimes are repeated in Figure 2, which also introduces an ugly shading scheme to distinguish languages. When you look at it, do you think “black…aha, black = English” ? No, you do not. There’s no need for different shading or colour here (it’s not even visible for two bars); the bars are readily distinguishable from the x-axis labels.

Figure 3 repeats the shading crime and Table 1 is somewhat superfluous, as it contains much of the same data. Several more tables follow, containing data which might be better presented as charts.

Figure 4: all the previous horrors

Figure 4: all the previous horrors

Figure 4 combines all the previous horrors into 3 panels. We could go on, but let’s not. You can see the rest for yourself, it’s open access.

Publication on the basis of “soundness” need not mean sacrifices in quality. Ideally, someone at some stage in the process – a mentor before submission, a reviewer, an editor – should notice when figures are not produced to an appropriate standard and suggest improvements. I see a lot of failures like this one in the literature and the causes run right through the science career timeline. It starts with poor student training and ends with reviewers and editors who don’t know how to assess the quality of data analysis/visualisation.

It’s easy to blame “peer review lite”, but there are deeper, systemic issues of grave concern here.

Filed under: publications, statistics Tagged: peer review, peerj, quality, visualisation

Some brief thoughts on the end of FriendFeed

View the article’s original source
Author: nsaunders

There was a time, around 2009 or so, when almost every post at this blog was tagged “friendfeed”. So with the announcement (which frankly I expected 5 years ago) that it is to be shut down, I guess a few words are in order.

I’m thankful to FriendFeed for facilitating many of my current online friendships. It was uniquely successful in creating communities composed of people with an interest in how to do science online, not just talk about (i.e. communicate) science online. It was justly famous for bringing together research scientists with other communities: librarians in particular, people from the “tech world”, patient advocates, educators – all under the umbrella of a common interest in “open science”. We even got a publication or two out of it.

To this day I am not sure why it worked so well. One key feature was that it allowed people to coalesce around pieces of information. In contrast to other networks it was the information, presented via a sparse, functional interface, that initially brought people together, as opposed to the user profile. There was probably also a strong element of “right people in the right place at the right time.”

It’s touching that people are name-checking me on Twitter regarding the news of the shutdown, given that no trace of my FriendFeed activity remains online. Realising that my activity was getting more and more difficult to retrieve for archiving and that bugs were never going to be fixed, I opted several years ago to delete my account. The loss of my content pains me to this day, but inaccurate public representation of my activities due to poor technical implementation pains me more.

I’ve seen a few reactions along the lines of “what is all the fuss about.” How short is our collective memory. To those people: look at Facebook, Yammer or even Twitter and ask yourself where the idea of a stream of items with associated discussion came from.

Farewell then FriendFeed, pioneer tool of the online open science community. We never did find a tool quite as good as you.

Filed under: networking, open science Tagged: friendfeed

Make prettier documents by reusing chunks in RMarkdown

View the article’s original source
Author: nsaunders

No revelations here, just a little R tip for generating more readable documents.

Screenshot-RStudio.png

Original with lots of code at the top

There are times when I want to show code in a document, but I don’t want it to be the first thing that people see. What I want to see first is the output from that code. In this silly example, I want the reader to focus their attention on the result of myFunction(), which is 49.

---
title: "Testing chunk reuse"
author: "Neil Saunders"
date: "24/02/2015"
output: html_document
---

## Introduction
Here is my very interesting document. But first, let me show you my long and ugly R function.

```{r chunk 1}
# it's not really long and ugly
# it just squares the input
# but imagine that it is long and ugly

myFunction <- function(x) {
  print(x ^ 2)
}

myFunction(7)
```

Screenshot-RStudio-1.png

Function use before definition = error

I could define myFunction() later in the document but of course that leads to an error when the function is called before it has been defined.

---
title: "Testing chunk reuse"
author: "Neil Saunders"
date: "24/02/2015"
output: html_document
---

## Introduction
Here is my very interesting document.

```{r chunk1}
myFunction(7)
```

## This is chunk 2
My long and ugly R function is now down here.

```{r chunk2}
# it's not really long and ugly
# it just squares the input
# but imagine that it is long and ugly

myFunction <- function(x) {
  print(x ^ 2)
}
```

Solution: use the chunk option ref.label to call chunk 2 from chunk 1. You can also use echo=FALSE to hide chunk1 in the final document, but still see the code (in chunk 2) and its output.

---
title: "Testing chunk reuse"
author: "Neil Saunders"
date: "24/02/2015"
output: html_document
---

## Introduction
Here is my very interesting document.

Chunk 1 is calling chunk 2 here, but you can't see it.
```{r chunk1, ref.label="chunk2", echo=FALSE}
```

## This chunk is unnamed but can now use code from chunk 2
```{r}
myFunction(7)
```

## This is chunk 2
My long and ugly R function is now down here.

```{r chunk2}
# it's not really long and ugly
# it just squares the input
# but imagine that it is long and ugly

myFunction <- function(x) {
  print(x ^ 2)
}
```

Screenshot-RStudio-2.png

The result of calling chunk2 from chunk1

And here’s the result.
Filed under: programming, R, statistics Tagged: how to, knitr, rmarkdown, rstats

Academic Karma: a case study in how not to use open data

View the article’s original source
Author: nsaunders

Update: in response to my feedback, auto-generated profiles without accounts are no longer displayed at Academic Karma. Well done and thanks to them for the rapid response.

A news story in Nature last year caused considerable mirth and consternation in my social networks by claiming that ResearchGate, a “Facebook for scientists”, is widely-used and visited by scientists. Since this is true of nobody that we know, we can only assume that there is a whole “other” sub-network of scientists defined by both usage of ResearchGate and willingness to take Nature surveys seriously.

You might be forgiven, however, for assuming that I have a profile at ResearchGate because here it is. Except: it is not. That page was generated automatically by ResearchGate, using what they could glean about me from bits of public data on the Web. Since they have only discovered about one-third of my professional publications, it’s a gross misrepresentation of my achievements and activity. I could claim the profile, log in and improve the data, but I don’t want to expose myself and everyone I know to marketing spam until the end of time.

One issue with providing open data about yourself online is that you can’t predict how it might be used. Which brings me to Academic Karma.

Academic Karma came to my attention on Twitter via Chris Gunter.

Tipped off to an AcademicKarma profile I did not set up for myself. Looks like years of reviewing/editing mean zilch. http://t.co/45DrnoDvlZ

— Chris Gunter (@girlscientist) February 11, 2015

To which they replied:

@girlscientist Everyone with an @ORCID has an Academic Karma profile, think of it as an Academic directory.

— Academic Karma (@AcademicKarma) February 12, 2015

Everyone with an ORCID? I have one of those. Sure enough, appending my ORCID ID to their URL reveals that I have a profile.

You’ll note that my profile states “no review information shared” and that the data are sourced from ORCID. These are recent changes, brought about by one of my less polite tweets.

and if I don’t want one? More ResearchGate-style bullshit. MT @AcademicKarma Everyone with an ORCID has an Academic Karma profile

— Neil Saunders (@neilfws) February 18, 2015

Karma, apparently, according to someone

Karma, apparently, according to someone

Previously, profiles looked like the one shown in the image, right. In my case, as I have not included any reviewing or editorial activity in my ORCID profile, this resulted in a large, prominent “NA” for so-called “karma earnt”. This gave the misleading impression that I am a bad “corporate citizen”.

To their credit, the people behind Academic Karma made changes to profile views very quickly, based on my feedback. That said, they seemed genuinely bemused by my criticism at times.

@neilfws @AcademicKarma Wow! Genuinely trying to improve peer review here Neil. Value your feedback though on what we could do differently.

— Lachlan Coin (@lachlancoin) February 18, 2015

@neilfws @AcademicKarma @ORCID_Org designed for data re-use, what are we misrepresenting?

— Lachlan Coin (@lachlancoin) February 18, 2015

So let me try to spell it out as best I can.

  1. I object to the automated generation of public profiles, without my knowledge or consent, which could be construed as having been generated by me
  2. I especially object when those profiles convey an impression of my character, such as “someone who publishes but does not review”, based on incomplete and misleading data

I’m sure that the Academic Karma team mean well and believe that what they’re doing can improve the research process. However, it seems to me that this is a classic case of enthusiasm for technological solutions without due consideration of the human and social aspects.

Filed under: networking Tagged: academic karma, orcid, researchgate, social networking

Presentations online for Bioinformatics FOAM 2015

View the article’s original source
Author: nsaunders

Off to Melbourne tomorrow for perhaps my favourite annual work event: the Bioinformatics FOAM (Focus on Analytical Methods) meeting, organised by CSIRO.

Unfortunately, but for good reasons, it’s an internal event this year, but I’m putting my presentations online. I’ll be speaking twice; the first for Thursday is called “Online bioinformatics forums: why do we keep asking the same questions?” It’s an informal, subjective survey of the questions that come up again and again at bioinformatics Q&A forums such as Biostars and my attempt to understand why this is the case. Of course one simple answer might be selection bias – we don’t observe the users who came, found that their question already had an answer and so did not ask it again. I’ll also try to articulate my concern that many people view bioinformatics as a collection of recipe-style solutions to specific tasks, rather than a philosophy of how to do biological data analysis.

My second talk on Friday is called “Should I be dead? a very personal genomics.” It’s a more practical talk, outlining how I converted my own 23andMe raw data to VCF format, for use with the Ensembl Variant Effect Predictor. The question for the end – which I’ve left open – is this: as personal genomics becomes commonplace, we’re going to need simple but effective reporting tools that patients and their clinicians can use. What are those tools going to look like?

Looking forward to spending some time in Melbourne and hopefully catching up with this awesome lady.

Filed under: australia, bioinformatics, meetings Tagged: csiro, foam, presentations