Tuesday, 11 December 2012

EBI as a data refinery


In describing what the EBI does, it is sometimes hard to provide a feel for the complexity and detail of our processes. Recently I have been using an analogy of  EBI as a “data refinery”: it takes in raw data streams ("feedstocks"), combines and refines them, and transforms them into multi-use outputs ("products"). I see it as a pristine, state-of-the-art refinery with large, imposing central tanks (perhaps with steam venting here and there for effect) from which massive pipes emerge, covered in reflective cladding and connected in refreshingly sensible ways. In the foreground are arrayed a series of smaller tanks and systems, interconnected in more complex ways. Surrounding the whole are workshops, and if you look close enough you can see a group of workers decommissioning one part of the system whilst another builds a new one.

 Oil refinery on the north end of March Point; Mount Baker
I find this analogy useful for a number of reasons. First, a "product" is often itself a feedstock, which is why the EBI has so many complex cycles of information. For example, InterPro member database models and patterns are feedstocks for the InterPro entries; during refinement they become associated with one another, documentation and gene ontology (GO) assignments. InterPro takes in UniProt (UniParc) protein sequences and combines them with models to provide boundaries on proteins; these in turn allow the ‘InterPro2GO’ GO assignment process to occur. This automatic GO annotation is then applied to the UniProtKB entries along with experimentally defined GO annotations which come from GO curators worldwide, and include many entries about model organisms .The InterPro entries additionally provide raw information (feedstock) for the UniRule automatic annotation, where InterPro matches are  the mainstay condition of a particular rule, which the UniProt curator combines with other conditions such taxonomic restrictions and sequence properties , ensuring the most accurate application of the  annotation extracted from the experimentally proven UniProtKB entries to the proteins of unknown function.

This is a complex network of inputs and outputs, (just writing it down and trying to keep it all straight is exhausting unless you are part of it – I went through a couple of rounds with Claire O’Donovan and Sarah Hunter to get the above flow absolutely straight) but the main input – bare protein sequences (coming from internal feedstocks including ENA and Ensembl) –is being converted into the main output: annotated protein entries, with human-readable annotation and careful audit trail of its 'refinement'. This is what the user sees as the output of the refinery, and understandably does not want to spend too much time worrying about the details of pipe connectivity inside the refinery.

Another reason I find the refinery analogy useful is because volume can be deceptive. The biggest, most impressive tanks in this refinery are filled with DNA sequence data but for the refinery to work as a whole it needs many "specialist" chemicals, in lower volumes, to serve as critical catalytic components. It might be necessary for the refinery to make and store some components in order to streamline a more complex flow of information. The EBI works with key "catalyst" streams of information that have a disproportionate impact relative to their volume (e.g. this assignment of experimentally defined annotation).

A deceptive view of this refinery would focus exclusively on the final outputs and the most recent refinement process, without taking in the intricate web of components behind them. People might use Reactome or IntAct to understand a particular functional dataset, but the protein information in these resources depends on UniProt to track and annotate these sequences. The protein information in UniProtKB is dependent on the ENA database smoothly accepting submissions with annotated CDS proteins present. In this way, asking to visualise, say, phosphoprotein results on a pathway diagram is not as simple as it might seem. It implicitly draws on many of the tanks in the EBI refinery. This larger network actually goes beyond the EBI's borders to its worldwide collaborators (e.g. wwPDB, or the INSDC’s GenBank/ENA/DDBJ).

The final "product" that the user sees often has a local manufacturer (i.e., bioinformatician/computational biologist) who pulls in information from the large tanks and combines it with local data to provide an overall picture and give context. Often, the research group querying EBI data does not worry too much about the details of how the refinery works, or about the complex inter-dependencies of the refinery; they just want easy access to a product they can rely on. It is the job of the EBI, and in future will be the job of ELIXIR, to satisfy this desire.

A refinery does not stay still. In each process, engineers (in our case bioinformaticians and software engineers) work to improve minor, everyday things and to carry out major retooling. New types of experimental information might require a new tank and pipelines, or become cheap enough to replace older feedstocks, in both cases opening up potential for new, useful products. New discoveries might change the way processes or transformations are handled, perhaps by adding a certain catalyst at a particular stage to improve the products.

Clearly the EBI is not the only refinery. Our European partners, such as SIB and Sanger, collaborate so closely with us on key projects that it’s hard to work out where one refinery stops and the other begins. We exchange data and expertise regularly with large refineries in the US and Japan, such as NCBI, UCSC, NIG and RCSB. It is exciting to see all of the proto-refineries in Europe, which offer different core competencies and are coalescing into a single robust, refinery: ELIXIR.

Like all analogies, this is not perfect. The concept of free data sharing, which is at the heart of molecular biology, does not fit well with this analogy. Although the complex process of providing the necessary CPU, disk and network has some resonance with the internal “plant” infrastructure, the fact that it is so generic and tradable does not. The EBI's products are also directly used via the web, often without much intermediation (no need for a network of gas stations, etc.). Nevertheless, the picture of a complex interplay of inputs being progressively refined is helpful when trying to disentangle some of our trickier problems.

I welcome feedback on this analogy, and to what extent it helps one understand the EBI.

Thursday, 6 December 2012

West meets East


I've just come back from around 10 days in China, visiting Nanjing, Shanghai and Hong Kong, and have a whole new perspective on this part of the world. I was not able to work Beijing into my trip this time, which was frustrating because I know there is a lot of good science happening there.

What was really different about this trip was that I came away feeling much more of a connection to China. It was great to meet new people and to renew more longstanding scientific contacts – but I also had more time (and, perhaps more importantly, more confidence) to travel between cities, have breakfast in local cafes rather than hotels, and generally get to know each place a little better. Previous trips (this was my fourth) required such a packed schedule that jetlag and the whole novelty of China completely dominated my experience.

Now that I’m sitting down to write about the experience, the first thing I’m inclined to do is draw some analogies with western countries. But analogies only go so far - even when they fit relatively well, they break down in the face of China’s distinct character. I do feel more knowledgeable than I have after previous visit to China, but I fully expect that future visits will reveal further dimensions and facets to this immense and complex country.

On some level, China reminds me of the US: it’s a huge country with vast distances to travel between locations, and has a tremendously strong sense of a single nation. Everyone I met considered themselves "Chinese", and there is a strong sense of a binding history and cultural underpinning. Also, similar to the US, China (and Chinese...) is aware of its size and economic power, and is conscious of having strong voice on the world stage. Hong Kong, Shanghai and Beijing are cosmopolitan cities, with a sometime exuberant celebration of the past 20 years economic growth. I won’t stray into geopolitics – it’s not my field of expertise at all – but a country of this size with sophisticated metroplotian areas will almost certainly make a big impact on science over the next couple of decades.

China shares some features with Europe – notably a diversity of language and culture across many provinces. Chinese provinces are often larger than European countries, and often have similar overall GDP. The many Chinese "dialects" are better described as different spoken languages, but importantly they share a set of written characters (with some modifications).  The implications of having a universally comprehensible written language for such a range of linguistic groups are profound.

My initial impression was that China had two major languages – Cantonese (used around Guandong and Hong Kong) and Mandarin – with various dialects, but this trip really impressed upon me just how diverse the linguistic landscape of Mainland China is. For example, Shanghaise is a dialect of Wu, which is a language family predominant in the eastern central area. When I was out for dinner in Shanghai with a Mandarin speaker, the waiter spoke to us in this lilting tone (Shanghaiese, as it turned out) and I turned to my companion for translation; she smiled, shrugged her shoulders and shifted the conversation to Mandarin. It was like dining with an Italian colleague in Finland and thinking she would know Finnish.

I’m much more aware now of the distinctive character and cultures of China’s provinces, which, along with the importance of personal networks, resonates with Europe.

While it’s fun to draw familiar parallels, China is clearly nothing like a mixture of the US and Europe. It is hard enough to completely understand the historical perspectives and cultures of one’s neighbours – it is going to be a long time before I will completely grasp the fundamental complexities of China. What I can say now is that its diversity is more and more fascinating to me, and something to be celebrated.


I wrote some time ago about scientific collaboration with China (see East meets West ), focusing on the positive aspects of openness and collaboration in engaging with this and other emerging economies (i.e. Brazil, India, Russia and Vietnam). As scientists, we have the good fortune of being expected to share scientific advances, discuss collaborations, discover new things jointly because they are the right thing to do – socially and strategically.


China already has some leading scientists and excellent scientific institutions, and I am sure this will only grow in the future. But communication is an essential component of community, and social media has been highly beneficial in keeping information flowing in much of the global scientific community. It’s frustrating that news platforms like Twitter are blocked in China. The EBI has set up a Weibo account (www.weibo.com/emblebi) where we will be posting (in English!) news items from the EBI. Hopwfully this help keep scientists in China up to date with developments at the EBI – so please do distribute to your Chinese colleagues.


On a more personal note, I've discovered that my first name (Ewan) is pronounced (in some dialects) almost identically to Yuan (a Chinese word for money). In Wikipedia, one of the pronunciation descriptions of Yuan is written identically to one of Ewan (what more proof do you need!) but I am not clear (a) if this is a variation in pronouncing Yuan in Mandarin or a dialect shift and (b) what tonal form it has. I'd be delighted to get some sort of linguistic survey of Yuan forms geo-tagged across China. People who have read my name sometimes get confused because they have a pre-formed idea of how to pronounce it (often "Evan" or "Ee-Wan" – one to save for my next Star Wars role). So it’s useful to know that I can say, "Ewan, like Money, Yuan," and this will provide some relief to my new acquaintance, who can file the name alongside a well-known phrase. (And before you say it, I know that I am just as bad when it comes to pronouncing some names – Chinese or not – in other languages!)

So - I'm "Money" Birney. I can't quite work out whether I should be proud or a bit worried about this moniker.

Many, many thanks to my hosts and the new people I met on this journey: Ying, Philipp, Jing, Jun, Huaming, Hong, Laurie, Scott and many others. I look forward to seeing you again, and learning more on my next trips to China.

Thursday, 1 November 2012

Literature services blossoming in Europe


Yesterday saw the switch of "UKPMC" (UK PubMed Central) to "Europe PMC" (Europe PubMed Central). This is mainly cosmetic at the moment (branding, look and feel, etc), but is the start of the real blossoming of literature services in Europe and at the EBI; get ready for lots more changes in 2013 and 2014.

Just a quick bit of history for people who are not totally au fait with all things literature; Abstracts of published articles have been collected by a number of groups for some time, in particular the National Library of Medicine in the US, who distributed them pre-internet age on a variety of media - (I remember a great CD loader system when I was a young kid just getting into bioinformatics with a sort clunky minature robot thing. You wanted to do corpus wide searches just to see the robot in action). This was called "MEDLINE", and the internet-accessible MEDLINE was called "PubMed", run by NCBI (NCBI is part of NLM). At the end of the 90s and start of this century there was a concerted push by a number of people for full text open access publishing, and the NCBI set up "PubMed Central", abbreviated PMC. This has been running now for about 12 years, with a healthy number of journals either open access or offering the ability for per-article open access. A key part was an NIH mandate to deposit articles in PMC, either via journal or via author submission. There are complications in this (for example, there is a distinction between 'free to read' access and 'free to read and reuse' open access), but compared to a decade ago there has been a remarkable change, and I am sure this will continue.

Around 5 years ago as the EBI matured we decided to put more emphasis into Literature services, and we were lucky to recruit Jo McEntyre. There was also a similar move in a number of (at the time) UK funding agencies to mandate article submission to PubMed Central. As part of this, the funding agencies knew that they also had to invest in the infrastructure, resulting in the development of UKPMC, a joint venture between the British Library, University of Manchester and the EBI, with the project being led by the EBI since 2011. UKPMC is synchronized with PMC (ie, all PMC articles are visible in Europe PMC, and visa versa, like the DNA archives).

With the joining of the ERC as the third non-UK European funder requiring submission of full text, calling this UKPMC was increasingly out of date. Europe has always had a large group of text miners and people involved in the publishing world (BioMedCentral/Springer for example, as well as the big publishing houses, such as Elsevier and Macmillan - and of course there are mixed views about publishing models which I won't go into on this blog), and so there is a lot of expertise and knowledge to leverage across Europe on literature - as well as continuing to work with colleagues in the US and further afield, such as China and Japan.

Europe PMC has sensibly decided not to make an artificial distinction between abstracts (old-style MEDLINE) and full text in Europe PMC, and so the previous EBI abstract service (CiteXplore) has been merged with full text into one web portal - www.europepmc.org - so the single web site allows you to search all literature that Europe PMC has access to (including PubMed abstracts). We've also taken advantage of Deitrich Rebholz-Schuhmann's great "WhatIzIt" framework for text recognition of genes/proteins, chemicals, diseases, accession numbers, organisms, and GO terms, allowing better linkage with the "factual" databases in the EBI. Europe PMC already has a very effective web service that allows for on-demand expansions of literature information, which a number of other groups are using.

But it wont stop here - Jo works with other collaborators in Europe (such as the Swiss Institute for Bioinformatics, the European Patent Office and OpenAIRE Plus); there is likely to be even more blurring of the lines between "traditional" publishing, full text archiving and databases (indeed, we held a very successful workshop on campus on this topic just recently). I am looking forward to a steady increase of useful features coming into Europe PMC, both on the website and in terms of other services.



You can follow EuropePMC progress on its blog (http://blog.europepmc.org/) and twitter (@EuropePMC_news)


And congratulations to Jo and the team on a great looking, useful web site. Bookmark it now!



Monday, 17 September 2012

Human genetics; a tale of allele frequencies, effect sizes and phenotypes



     A long time ago I was on the edge of the debate about SNPs in the late 1990s; of whether there should be an investment in first discovering, and then characterising and then using many many biallelic markers to track down diseases. As is now obvious, this decision was taken (First the SNP Consortium, then the HapMap project and its successor, 1000 genomes, and then many Genome wide association studies). I was quite young at the time (in my mid to late twenties; I even had a earring at the start of this as I was a young, rebellious man) and came from a background of sequence analysis - so it was quite confusing I remember getting my head around all the different terminology and subtlies of the argument. I think it was Lon Cardon who patiently explained to me yet again the concepts and he finished by saying that the real headache was that there were just so many low frequency alleles that were going to be hidden and that was going to be a missed opportunity. I nodded at the time, adding yet one more confused concept in my head to discussions about r2 behaviours, population structure, effect size and recombination hotspots all of which didn't sit totally comfortably in my head at the time.


    That debate is worth someone trying to reconstruct and write up (I wonder if those meetings are recorded somewhere) as in fact, as in many scientific debates, everyone was right at some level. For the proponents of the SNP approach, it definitely "worked" - statistically strong reproducible loci were found for many diseases. Although these days people complain about the lack of predictions from GWAS, at the time the concern was not whether there would be some missing heritability issue, but (as I remember) about whether it would work at all.  It did, and in spades - just open an issue of Nature genetics. However for the people who were cautioning that there would be alot more complexity to disease - allelic hetreogenity, complex relationships between SNPs (both locally and globally) and then this curse of allele frequencies, let alone anything more complex, such as gene/environment, parent-of-origin or even epigenetic trans-generational inherietance (I list these in the rough order of my own assessment of impact; feel free to order to taste), they also are definitely proved right by our current scenario.


    Remembering that young man in his mid twenties, confused by all the terms spinning around each of these pieces of complexity deserves unpicking. Allelic hetreogenity is when a locus is involved in a disease (for example, the LDL Receptor - the LDLR gene - with Familial hypercholesterolaemia), but there are many different alleles (a different mutation) often with different effects involved in the disease. This means the disease is definitely genetic; that a particular gene is definitely involved; but that no particular SNP is found at a high level associated with the disease as there are 100s or so different (probably) causitive alleles. The complex relationship between SNPs, epistasis, is both at a local ("haplotype") level where there might be a particular combination that's critical or globally.  A good example of this local complexity is the study by Daniel MacArthur and colleagues where they found that a number of apparent frameshift mutations, predicted to be null alleles, were "corrected" back in frame, making (in effect) a series of protein substitution changes presumably with a far milder, if any, effect.  If you try to model each variant alone here one makes very different inferences from modelling the haplotype; in theory one should try to model the whole, global genotype.

   And only recently have I really come to appreciate the headaches that Lon was trying to explain to me around allele frequency. One of the early and robust predictions of population genetics, which is pretty obvious when you think about it, is that one expects an exponential decay of alleles compared to frequency in the population - ie, lots, lots more rare alleles than common alleles. This is because when a mutation happens, it must start at a ratio of 1 to "the whole size of the population" and can only grow bigger generation by generation. If the allele doesn't effect anything you can model this process very elegantly as a random walk. For starters this random walk tends to stay pretty low frequencies just because it is random, and in fact the most likely thing is that it randomly dissappears from the population. Now if this allele has a deleterious effect - which is basically what we expect for disease associated alleles - then it is even more likely to stay at a low frequency. I visualise this as the genome having a sort of series of little bubbles (variants) coming off them, and these bubbles nearly always popping straight away (variant going to zero); only rarely does a bubble get big (grow in frequency in the population). A disease effect is always pushing those bubbles associated with disease to be smaller. And - often - you can't even see the small bubbles. At the limit, every loci will have complex allele hetreogenity; the only question is how big - in both frequencies and in effect - are some of the alleles.


   Having appreciated this at a far deeper level now (partly from looking at a lot more data myself) I am now even more impressed that GWAS works. For GWAS one not only needs to have a variant tagging your disease variant, but that's got to be at a reasonable enough frequency to detect something statistically - one or two individuals will not cut it. This is one of the big drivers for the large sample sizes in genome wide association studies - large sample sizes are needed just to capture enough of the minor allele of rare variants - and remember that the majority of variation is in this "rare" scenario.


    But the other place where we can improve our ability to understand things was illustrated by a talk by Samuli Ripatti, working with other colleagues worldwide (including my new collaborator, Nicole Soranzo) on lipid measurements. They took a far larger set of lipid measurements than is normally done in a clinical setting with a alphabet soup of HDL and LDL sub types, along with all sort of amino acids. From this not only did they recapitulate all the existing HDL and LDL associations, but very often the specific subtypes of LDL or HDL showed far stronger effects than the composite measurements. At some sense this is no surprise - the closer one gets to measuring a biological end point of genes, the bigger effect you will see from variants, whereas more composite measurements must have more sources of variants by their very nature. And this is where all the molecular measurement techniques of chip-seq, RNA-seq, etc (exploited and explored in projects like ENCODE and others) is going to be very interesting, though we wont be able to do everything on every cell type.

 
   So - the moral of this story is two fold. Firstly we will need large sample sizes to understand the full set of genetic effects - despite many people telling me this over the last three or four years, it only really "clicked" in my head in the last 6 months.  Secondly we need to raise our (collective) game in phenotyping, and not just molecular phenotyping, or cellular, or endo, or disease - but all types of phenotyping, as the closer we can get to the genotype from the phenotype end, the better powered we are.


And many, many groups worldwide are getting stuck into this, telling me that we have at least another decade's worth of discovery coming from relatively "straightforward" (in concepts, though not in practice, logistics, sequencing or analysis!) human genetics.

Sunday, 9 September 2012

Response on ENCODE reaction


The publication of ENCODE data raised substantial discussions. Clear, open, rational debate with access to data is the cornerstone of science. For the scientific details the ENCODE papers are totally open, and we have aimed for a high level of transparency e.g. a virtual machine to provide complete access to data and code.

There is an important discussion – which no doubt will continue throughout this decade – about the correspondence between reproducible biochemical events on the genomes, their downstream cellular and organismal functions, their selection patterns in evolution and their roles in disease. ENCODE provides a substantial new dataset for this discussion, not some definitive answer, and is part of a longer arc of science in this general area. I touch on this on my blog 

There are also "meta" questions concerning the balance of "big" and "small" science, and how "big" science projects should be conducted. The Nature commentary I wrote focuses on this.

ENCODE also had the chance of making our results comprehensible to the general public: those who fund the work (the taxpayers) and those who may benefit from these discoveries in the future. To do this we needed to reach out to journalists and help them create engaging stories for their readers and viewers, not for the readers of Nature or Science. For me, the driving concern was to avoid over-hyping the medical applications, and to emphasize that ENCODE is providing a foundational resource akin to the human genome.

With hindsight, we could have used different terminology to convey the concepts, consequence and massive extent of genomic events we observed. (Note to self: one can be precise about definitions in paper or a scientific talk to scientists, but it’s far harder via the medium of everyday press, even to the same audience). I do think we got our point to the general public: that there is a staggering amount of activity in the genome, and that this opens up a lot of sophisticated and highly relevant scientific questions. There was a considerable amount of positive mainstream press, sometimes quite nuanced. Hindsight is a cruel and wonderful thing, and probably we could have achieved the same thing without generating this unneeded, confusing discussion on what we meant and how we said it.

I am tremendously proud of the way that the consortium worked together and created the resources that it did. The real measure of a foundational resource such as ENCODE is not the press reaction, nor the papers, but the use of its data by many scientists in the future.