Friday, 21 November 2014

Another step into Quantitative Genetics...


Today sees the publication of the paper by Zhihao Ding, YunYun Ni, Sander Timmer  and colleagues (including myself) on local sequence effects and different modes of X-chromosome association as revealed by the quantitative genetics of CTCF binding. This paper represents the joint work of three main groups: Richard Durbin's at the Sanger Institute, Vishy Iyer's at U. Texas, Austin and my own at EMBL-EBI. I'm delighted that this work from Zhihao, YunYun and Sander (the three co-first authors) that it's finally come out, and want to share some aspects of the work that were particularly interesting to me.


Stepping back into quantitative genetics

This is the second time I've shifted research direction. Even though it's still broadly in the area of bioinformatics, quantitative genetics is a discipline I hadn't explored very deeply before embarking on this paper. Quantitative genetics is a very old field - arguably one that lit the spark that set off molecular biology as a science - and the birthplace of statistical methods for life science. 
RNA Fish for X chromosome inactivation

Legends of frequentist statistics - Pearson (of Pearson's correlation) and Fisher (of Fisher's exact test) were motivated by biological problems around variation and genetics starting in the 1890s through the 1930s/1940s. This was also the time of the rediscovery of Mendel, and the fusing of Mendel's laws with evolution and a whole host of fundamental discoveries: for example, Morgan's discovery that chromosomes carry the units of hereditary information. My personal favourite is one on X-Y crossovers discovered in my favourite fish, Medaka in 1921, highlighted by a great editor's comment. Reading some of these papers can be spine tingling. These scientists did not know about DNA, DNA polymorphisms as we understand them, or genes as RNA transcription units. Still, they could figure out many features of how genetics effected traits.

Quantitative genetics fell out of fashion in the 1970s and 1980s as the new, wonderful world of molecular biology sprang up. The new capability to isolate, manipulate and use nucleic acids seemed to bypass the need to infer idealised things statistically from whole organismal systems. There was also a push to use forward genetics (i.e. large genetic screens) in model organisms - in fact, some of the best work was based on the same organism (Drosophila) on which the original genetics was worked out. 

In humans, the most systematic discovery of single locus genetic diseases - Mendelian disorders - led to epic, systematic "mapping" trips for these (CFTR being perhaps the most iconic). Only the plant and animal breeders kept the flame of quantitative genetics alive during this time.


Human genome + Dolly the sheep + transparent pigs = great time to be in genetics

The process of mapping human disease-causing genes was a big part of the motivation for the human genome project. At some point it became clear that rather than being done as a collection of rather individual laboratory efforts, this mapping would be far better off taking a systematic approach. John Sulston and Bob Waterston surprised everyone by showing that sequencing a large complex genome (in this case, C. eleganscould be done by scaling up the process of sequencing in a rather factory-like way - and this was just in time for the larger-scale human genome project. 

(At the time all this was getting underway, I was an undergraduate transitioning to a PhD student at the Sanger Institute. I came into it during a rather exciting three-year period when the public project grappled with the entry of a privately funded competitor, Celera. It was a fascinating time but I should perhaps write about it in another blog post. Back to quantitative genetics!)

I was introduced properly to quantitative genetics when I became part of the governance board of the Roslin Institute. (I still don't know who suggested me as a Board member - Alan Archibald? Dave Burt?) Scientifically and politically, it was a real eye-opener. I learned (rather too much) about how BBSRC Institutes are run and constituted, and started to appreciate the tremendous power of genetics. 

The Roslin was in the news for the successful nuclear transfer from a somatic cell to make a new embryo, Dolly the sheep, and this was great work. (They also made transgenic chickens and pigs - a glow-in-the-dark pig is quite a thing to see, and surprisingly useful). Together with the Genetics department in Edinburgh, the Roslin was one of the few places that still took quantitative genetics seriously. From talks with Chris Haley and Alan Archibald (and beers/scotch) I started to realise the awesome end-run that quantitative genetics does around problems. 

The great thing about genetics is that if you can measure something reliably and thus find the variance between individuals that is due to genetics, you can identify genetic components with the usual triad of good sample size, good study design and good analysis. How well you can do this requires understanding architecture of the trait (e. g. number of contributing loci, distribution of effect sizes) - itself something non trivial to work out, but in theory all measurable things with some genetic variance are discoverable, given enough samples. 


No guesswork required

Let me stress: you don't need to guess where to look in the genome to associate genetic variants to some you measure, so long as you test the entire genome sensibly. So - you needn't have a profound mechanistic understanding of what you want to measure,  - you just need the confidence that there is some genetic component there to and the ability to measure it well. Step back for a moment and think about all the things you want to know about an organism on the genetic level that you could measure. It could be something simple to measure, like height (which hides a lot of complexity), or something more complex, like metabolites (which are often driven by far simpler genetics), or something really complex, like mathematical ability in humans.

Chris, Alan and others really knew that quantitative genetics worked. For their own part, they were often paid by animal breeders (companies breeding chickens, pigs and cows) to improve rather complex traits such as milk yield or growth rate. These companies may have appreciated that good science was nice to have, but they were pragmatic and ultimately driven by the bottom line. If something did work, they would come back to Chris and Alan. Sure enough, Roslin had a lot of repeat business.

During this time I started to realise that a lot of quantative genetics methods developed at the end of 20th Cent.  were aimed to circumvent the facts that genomes are big and determining markers in many individuals is expensive. As next generation sequencing came online, I could see we were going to have a completely inverted situation: genotyping was going to get stupid cheap and dense, and accordingly any set of phenotypes (things you can measure) would become ammeanable to these techniques (again, with the caveats of appropriate sample size). And I really mean any phenotypes.


What to measure?

At some level, this is ridiculous. There are so many things you want to know about, on many scales - molecular, cellular, organism - and to have one technique (measure and associate) seems just a bit too conceptually simple. Of course, it's not so simple - there must be some genetic variance to get a handle on the problem (sometimes there are just no variants that have an effect in a particular gene), and the genetic architecture might be very complex, but these assumptions are reasonable for many, many interesting things - in particular the more "basic" things about living organisms. The big problem is choosing what to do. So, about five years ago I started to think more about what might be the most interesting things to measure, and in which systems.

For measurements it seems crazy to have a single variable captured each time. Why measure just one thing? You want a high-density phenotyping method so you can look at lots of variables at the same time. This leads you to two technologies for readout: Next-gen sequencing (all gene expression, or all binding sites) for molecular events or imaging cellular and organismal things. 

I knew I wanted to stay away from human disease as a system. This is a hotly contested field with a few big beasts controlling the drive towards different endpoints. I like working with these big beasts (they are usually very friendly and clever) but know it would be foolish to try and break into their area (far better to collaborate!). Talking to Chris and Alan (and later with epidemiologists George Davey-Smith and John Danesh), I came to realise that disease brings in all sorts of complications -- such as confounders of diagnosis and treatment. 

I also knew I wanted to work on things where you could close the loop from initial discovery to specific mechanism/result. and I wanted to work with experimental systems we could manipulate many times over.... 


Back to the paper...

And this, finally, leads us back to the new paper. We worked on a molecular phenotype: CTCF binding. CTCF is an interesting transcription factor that has a great chip antibody, which was great for the feasibility of our experiments. CTCF can be measured in a high-content manner (i.e. Chip-seq) when this is done using an experimentally reproducible system (i.e. human LCL cells; 50 of the 1000Genomes cell lines, so we had near-complete genotypes). It fits two criteria - it's a high dimensional phenotype (there are around 50,000 good CTCF sites in the genome) and we did tin an experimental system that we could go back to - LCLs. It is, arguably, my lab's first "proper" quantitative genetics paper. As ever when you switch fields, the project brought with it a multitude of details to sort out and ponder. 

For example, we had to learn to love qq-plots more than box-plots. QQ (quantile-quantile) plots make a big difference in whether your associations make sense, given the multiple testing problem. As the genome is a big place, you are going to test a lot of things, you are guaranteed to find "visually interesting" associations (boxplot looks good, Pvalue looks good) and you will not have any idea whether they are interesting enough given the number of tests. And - rather more subtly, you need to know whether your test is well behaved. A QQ plot summarises both of these neatly.  

There are deeper gotchas - notably population stratification - and getting a good feel for linkage disequilibrium does take a while. There are, of course, all sorts of measurement/technical issues, and in this case my experience with ENCODE gave me a good grounding in the practicalities of Chip-seq.

Much of what we discovered was to be expected. Indeed, variants do effect CTCF binding, particularly when they are in the CTCF binding motif (I can see a number of CTCF/transcription factor molecular biologists rolling their eyes at the obviousness of this). But there are a sizeable number of variants that are not in the motif and, presumably, affect binding via some other indirect mechanism (LD presents some complications here, but we're in a good position to assess this). We also saw a large number of allele-specific signals. Interestingly, we can show that when there is a difference between genetic variants for binding between individuals, it is related linearly to allele-specific levels in hetreozygous individuals - but not the other way around. But - there are allele-specific sites that do not show between-individual differences. This is not perhaps that surprising but good to see, and good to quantify the level of this.


Serendipity

And then biology just does something unexpected. As my student Sander Timmer was looking to find individual CTCF-to-CTCF "clean" correlations, he came upon a stubborn set of sites that formed a big hairball of correlations. This is not uncommon for this sort of data (there are all sorts of reasons why you get correlations: antibody efficiency, growth conditions, weird unplaced bits of genome present in one set but not another...). We were digging through all these options (blacklisting, weird samples, weird bits of genome) and this hairball just was not going away. I was almost at the stage of just acknowledging and accepting the hairball, assuming it was some artefact and moving on to the more individual site correlations, when Sander came in showing that nearly all the sites were on the X chromosome.


Aha.

It's easier to describe this from the perspective of what we believe is going on than to talk about feeling around the data for three months or so, trying to work it all out. The X chromosome is a sex chromosome, and males have one copy whereas females have two copies. This causes a headache for mammals in that if the X chromosome behaved "as normal", females would consistently show twice the expression of all X chromosome genes than males. 

In mammals this is "solved" by X chromosome inactivation, where one X chromosome in females is "inactivated" by quite an extreme molecular process called (unsurprisingly) X-inactivation. (Biology, in its weird and wonderful way, does this completely differently in other animal clades, eg, C.elegans or Drosophila. Go figure). This leads to a visibly compressed X chromosome in females (called the Barr body after the first person to characterise it), and this random inactivation underlies some classic phenotypes, for example tortoise shell cat (or calico cat, in US-speak) coat colouring is due to this. Multi-coloured eyes (sometimes in the same iris) in women is due the random choice of which X chromosome to inactivate. 

When you look at RNA levels in female cell lines, the vast majority of X chromosome genes have a similar expression level as males. There are exceptions (called "X-chromosome inactivation escape") and one famous, female-specific RNA - Xist - is a key molecule in X inactivation (see my previous post about the wonders of RNA).

What is CTCF doing?


All this is well established, but what did we expect to see for CTCF? CTCF is a very structural chromatin protein involved in chromatin loops. Sometimes these loops are very important for gene regulation, giving rise to CTCF's role in insulators. Sometimes there is some other looping mechanism. And there are CTCF binding sites everywhere - X chromosome included. So we thought there were three basic options for each CTCF site (there are ~1000 CTCF sites on X):

  1. CTCF site is rather like RNA - mainly suppressed, present only on the active X. If so, we'd expect males and females to have similar levels of CTCF.
  2. CTCF site is involved in X-inactivation (perhaps a bit, perhaps a lot) - If so, we'd expect there to be female-specific CTCF sites.
  3. CTCF site is oblivious to X-inactivation In this case we'd expect a ratio of 1:2 male:female.
We found sites in all 3 cases when we processed the CTCF data over the 51 samples (pretty evenly distributed male/female). We saw a lot of case 2 (CTCF is oblivous) and then quite a bit of 1 (CTCF being like RNA), and then a tiny number of 2 (CTCF involved in X-inactivation). 

Before I go any further, it's worth being sure of this result because a lot of things can happen/drive signal in these Chip-seq - or any genomics - experiment. The clincher for me was when we looked at individuals who were hetreozygous for alleles in option 1 vs option 3. For sites classified as "single active" (option 1) we see one predominant allele - consistent with just one chromosome being bound. For sites classified as "both active" (option 3) we see mainly a 50/50 split of alleles - consistent with both chromosomes being bound. 

All well and good, but we've now stumbled further into something that has been known for a long time: CTCF is a multi-functional protein involved in all sorts of things. 

Here, the process of X chromosome inactivation allows us to pull apart at least two classes (probably more things are happening within these classes as well), and thus get a large number of CTCF sites in two classes. And indeed, they look very different. One (single active) is alive with activating histone modifications and the other (both active) is pretty silent (don't know your histone modifications? Here's a cheat's guide). But the "both active" set has just as strong an overlap with conserved elements as the "single active" and, if anything, shows stronger nucleosome phasing around it (so it's definitely there). 

There were also cases with female-specific CTCF sites. These are not (as you might have thought) around the Xist locus, but instead two other non-coding RNA loci. The Chadwick group narrowed down these RNA loci on X as being interesting for different reasons, and we recapitulate their data that these RNAs are only expressed on the active X in females. Perhaps these female-specific CTCF sites are specifically wrapping up the RNA on the inactive X. Perhaps these are functional - and important for X inactivation. 


More questions than answers

We didn't take on the study to find out whether female-specific CTCF sites are wrapping up RNA on the inactive X, but in this respect our work raises more questions than it answers. Can we parse CTCF sites further out with bioinformatics? (There is a great story about CTCF site deposition by repeats - is this linked?) How conserved is the CTCF site status between close mammals and, given that we know many of them are in different locations in mouse, is there something common to these sites we should be looking at? Does anyone want to ... knock some of these sites out? Do we know how to phenotype them? Our work showed that CTCF is not the molecule involved in Barr body compaction, so... what is? 

I am, more than just a bit, tempted to go down these rabbit holes. Someone should... But I also want to stay on the path of using quantitative genetics for basic biology. Oh, the trials of having too many interesting questions.


So - this was a great paper to start my group into quantitative genetics. In many ways, the HipSci project (which I am part of) is the really well powered (many more samples) and better constructed (iPSCs rather than LCLs) - this paper is sort of a training ground for what to expect for genetic effects on chromatin, and joins a long line of RNAseq QTLs (eQTLs) and DNaseI or methylation QTLs by other groups. 






Tuesday, 28 October 2014

RNA is now a first class bioinformatics molecule.

RNA research is expanding very quickly, and a public resource for these extremely valuable datasets has been long overdue.

Some 30 years ago, scientists realised that RNA was not just an intermediary between DNA and protein (with a couple of functions on the side), but a polymer that could fold into complex shapes and catalyse countless reactions. The importance of RNA was cemented when the structure of the ribosome was determined (something that Venki Ramakrishnan, Ada E. Yonath, and Tom Steitz won the Nobel Prize for, eg here is Venki's Nobel lectureand it was confirmed that the core function of ribosomes – making a peptide bond between two amino acids – was catalysed by ribosomal RNA and not by proteins. It’s also likely that RNA – not protein, not DNA – was the first active biomolecule in the primordial soup that gave rise to Life. Indeed, one could easily see DNA as an efficient storage scheme for RNA information, and proteins as an extension of single-stranded RNA’s catalytic capabilities, enabled by the monstrous enzyme, ribosomal RNA. 

Even focusing on RNAs established role as the cell’s information carrier, the textbook mRNA, RNA-based interactions are widely recognised as being important. A real insight was the discovery of microRNA (miRNA): small RNAs whose actions lead to the down-regulation of transcripts by suppressing translation efficiency and cleaving mRNAs. MicroRNA has brought to life a whole new world of other small RNAs, many of which are involved in suppressing “genome parasites” – repeat sequences that every organism needs to manage.

And then there are long RNAs in mammalian genomes that do not encode proteins (i.e. long non-coding RNA - lincRNA) have long been recognised as having some significance – but what do they do? Some are clearly important, like the non-coding RNA poster child Xist, which inactivates one of the X chromosomes in female mammals to ensure the correct dosage of gene products. Others are involved in imprinting/epigenetic processes, for example the curiously named HOTAIR, which influences transcription on a neighbouring chromosome.

RNA: something missing

Discoveries in RNA biology have expanded the molecular biologist’s toolkit considerably in recent years. For instance, the cleavage systems from small RNAs can be used (in siRNA and shRNA ways) to knock down genes at a transcriptional level. The current “wow” technology, CRISPR/Cas9, is a bacterial phage defence system that uses an RNA-based component to adapt to new phages easily. This system has been repurposed for gene editing in (seemingly) all species – every genetics grant written these days probably has a CRISPR/Cas9 component.

And yet in terms of bioinformatics, RNA data was – until this past September – rather uncoordinated. There wasn’t a good way to talk consistently about well-known RNAs across all types, although this was sometimes coordinated in sub-fields such as Sam Griffiths-Jones’ excellent miRBase for miRNAs, or Todd Lowe’s gtRNAdb resource from for tRNAs. But because RNA data was mostly handled in one-off schemes, researchers working in this area were hindered. Computational research couldn’t progress to the next stages, for example capturing molecular function and process terms with GO or collecting protein–RNA interactions in a consistent way.

RNAcentral in the bioinformatics toolkit

So I’m delighted to see the RNAcentral project emerge (http://rnacentral.org/). RNAcentral is coordinating the excellent individual developments emerging in different RNA subdisciplines: miRNAs, piRNAs, lincRNAs, rRNAs, tRNAs and many more besides. It provides a common way to talk about RNA, which in turn allows other resources – such as the Gene Ontology or drug interactions databases – to slot in, usually precisely in the same “place” as the protein identifier.
Alex Bateman, who leads the RNAcentral project, has been exploring a more federated approach, quite deliberately gathering the hard-earned, community-driven expertise of member databases in specific, specialised areas of RNA biology.


RNAs were, potentially, the first things on our planet that could be considered “alive”. They are critical components in biology, not just volatile intermediaries. In terms of bioinformatics, giving RNA the same love, care and attention as proteins is long overdue, and I look forward to seeing RNAcentral provide the cohesion and stability this area of science so richly deserves.