Thursday, 7 May 2015

Human as a model organism

Model organisms have provided the foundation for building our understanding of life, including human disease. Homo sapiens has joined this select group, adding knowledge we can apply to our myriad companion species. But to resolve even one small part of the moving, shifting puzzle of life, we need them all.

Biology is incredibly complex. Even the simplest bacteria make intricate decisions and balance different demands, all via chemical reactions happening simultaneously in what seems like just a bag of molecules, called a cell. Larger organisms all start as a single cell and eventually become living creatures that can fly, or slither, or think – sometimes living for just a day and sometimes for centuries.

Whatever the process, whatever the outcome, it all begins with information, recorded in a tiny set of molecules (DNA) in the very first cell. How that information made it this far, and how it is now composed, comes down to the twin processes of random change (mutations) and competition between individuals, giving rise to evolution. Evolution has, quite amazingly, given rise to everything from uranium-feeding bacteria to massive sequoias and tax-filing, road-building, finger-painting humans.


Modelling life


Unpicking this complexity is hard, in part because so many things are happening all at once. We’ve been working on it for centuries, building layer upon layer of knowledge collectively, in many labs throughout the world, usually relying on specific organisms where we accumulate large amounts of knowledge on the processes of life. These ‘model’ organisms, for example the gut bacteria E. coli, are selected for their ease husbandry and other features of their biology. Interestingly, most of them have been our companions or domesticated in some way throughout our explosive growth as a species.

To create models of animal life processes at the simplest level, we use organisms like European and African yeast (used for both baking bread and making beer), which has a nucleus (like all animals, they are eukaryotes). We use the humble slime mould, which spends most of its time as a single cell but, in extremis, will band together to form a proto-organism that has given us insights into signalling. Taking it up a notch, we are helped by pests that have lived off our rubbish since our earliest days in Africa: fruit flies, mice and rats provide profound insights into animal life. Even the model worm C. elegans, which helps us understand development, could be considered ‘semi-domesticated’ (though no one really knows where ‘wild’ C. elegans might live).

Each of these models has its strengths and weaknesses: the time it takes to breed generations, the effort involved in handling them, the availability of automated phenotyping systems, the flexibility (or lack thereof) of their cellular lineage, and more exotic features, such as balancer chromosomes, RNAi ingestion, chromosomal engineering. But they all share one distinct quality: they are not human.


Using ourselves?


Using Homo sapiens as a model species to understand biology has many advantages, and some important drawbacks. Leaving aside for a moment the interaction with research into human disease, what are the benefits of using ourselves as an organism on which to model basic, fundamental life processes?

·            Humans are large, so we can acquire substantial amounts of material from consented individuals either from living persons (e.g. blood) or via autopsy;

·            The extremely large population can be accessed relatively easily, with no on-going husbandry costs;

·            Wild observational studies (i.e. epidemiology) are feasible to deploy at scale, though at considerable cost;

·            The population has good genetic properties: it is outbred, and mating is fairly random with respect to genotypes, usually with only geographic stratification;

·            Many phenotyping systems are designed explicitly for this organism, in some cases with a high level of automation;

·            An on-going, proactive screening process for rare, interesting events (i.e. ‘human clinical genetics’) are available in many parts of this population at the scale of millions of screening events each year;

·            Cells from this organism can be cultured routinely using iPSC techniques, and these cellular systems can be genetically modified and made into functional tissue-scale organoids;

·            Limited intervention studies are feasible (if expensive);

·            Research on this organism is well funded, thanks to widespread interest in human disease.

The drawbacks:

·            There are no inbred lines for Homo sapiens;

·            The large size and tissue complexity of this species, in particular the brain, presents significant challenges to understanding cellular and tissue behaviour;

·            The organism cannot be kept in a strictly defined environment (though an increasing number of aspects can be monitored in observational studies);

·            Explicit genetic crosses cannot be done (though the large number of individuals make it possible to observe many genetic scenarios in the population);

·            Genetically modified cells cannot be used to make an entire organism;

·            Intervention studies are quite limited by both safety and expense;

·            Ethical issues, which are important when studying any species, are more involved for Human – even for basic research.


An old story


Using Homo sapiens as a model species is not a new idea – it has been around since the dawn of genetics and molecular biology. Studies of human height motivated the early theory around quantitative genetics. Quite a bit of mammalian (and general eukaryotic) biochemistry and genetics was originally uncovered by discoveries of inborn errors in human metabolism in the 1960s and 1970s, and was confirmed by biochemistry studies in cow and pigeon tissue. And robust cancer-derived cell lines – most famously HeLa cells – have been used in molecular biology for decades.

But the downsides to using humans as a model species are far fewer in number now than they were two decades ago, when the human genome was considered to be so large that a major, global consortium was required to generate it. But the human genome is dwarfed in size and complexity by bread wheat and pine, whose genomes are being untangled today. The cost of human genetics studies has plummeted so that large populations are more accessible and easily leveraged (a genotyping array now costs under €50 and sequencing under €1,000), which is a major benefit for doing statistically robust studies. The result has been a resurgence of common and rare genetic approaches. The drop in sequencing cost has allowed more scalable assays, such as RNA-seq and ChIP-seq, which let us work routinely on the scale of a whole human genome.

A decade ago there was a far wider gap between experiments that were feasible on Drosophila, C. elegans or the yeasts, but not on human beings. The landscape has changed.


Human disease


The global economy is a human concept (though it affects all species) and a big chunk of it (10% to 17% in industrialised economies) is spent on healthcare. That is a huge amount of money. A considerable amount is already spent on clinical research, but the advent of inexpensive techniques to measure DNA, RNA, proteins and metabolites presents massive, new opportunities. It is now possible to blend scientific approaches that have traditionally been separate – experimental medicine and genomics, or epidemiology and bioinformatics – to exploit these measurement techniques alongside traditional clinical approaches.

The primary motivation for all this activity and expense is to understand and control human disease. But health and disease are constantly in flux, in humans as in all species, and often the process of understanding human disease is really just the same as understanding human biology – and that’s not so different from understanding biology as a whole. Fitting all the pieces together requires taking the best from all fields, and that is in itself a huge challenge.


Traditional models, rebooted


There is justifiable excitement around new opportunities to study humans as a model organism, but it is simply not the case that the established model organisms will become less and less relevant. Placing too strong an emphasis on Human studies could lead to inadvertently hindering research on other organisms, which would be counterproductive.

‘Model’ organisms help us create ‘models’ of life processes – they do not serve as ‘models’ for human organisms. Our grasp of molecular biology is still quite basic indeed: we have a firm grasp on only just over a third of protein coding genes in humans, and this number is not much higher in simple, well-studied organisms such as yeast. Even in cases where we have ‘established the function’ of a set of genes and can tie them to a specific process, we still have huge gaps in our comprehension of how these particular molecules can create exquisitely balanced, precise processes.

Leveraging the unique properties of different model organisms provides opportunities to innovate. For example, one remarkable paper demonstrates how a worm ‘thinks’ in real time, monitoring the individual firing of each (specifically known) neuron in the animal as different cues are past over its nose. The growing set of known enhancers in Drosophila allows for the genetic ablation of many cells, and the incredible precision of mouse genetic engineering allows precise triggering of defined molecular components. None of these experiments would be even remotely feasible in Human.
We would be very foolish to take a laser-like focus on this rather eccentric bipedal primate, however obsessed might be with keeping it healthy, happy and long-lived.

Our understanding of development in organisms, of homeostasis within organs, tissues and cells, and of the intricacies of behaviour is only just starting to develop. Metaphorically speaking, we have lit a match in a vast, dark hall – the task of illuminating the processes that drive these molecules to create full systems (that go on to type blog posts) is daunting, to say the least.


Hedging our bets


There are many hard miles of molecular and cellular biology ahead to improve our understanding of biology, with leads to follow in many different models (including human!) using many different approaches. This deeper understanding of biology will directly impact our understanding of human disease in the future. We need to spread our bets across this space.

Clinical researchers might have a harder time managing this, as the necessary focus on Human to understand human disease makes it all to easy to dismiss the future impact of other organisms on understanding human biology. The majority of molecular knowledge they currently deploy in their research is built on studies of a very diverse set of organisms. Useful and surprising insights and technologies can be gleaned from any organism.

Basic researchers, on the other hand, might dismiss the advent of human biology because it places inappropriate emphasis on applied research into the specifics of human disease. All human studies are not necessarily translational, and in any case the interweaving between understanding biology and understanding disease makes it impossible to really separate these two concerns.


To Human and back again


Over the next decade, the integration of molecular measurements with healthcare will deepen. This will almost certainly have a beneficial impact on the lives and health of many people worldwide. It also provides huge opportunities for the research community – obviously for applied research but also for curiosity-driven enquiry, as this massive part of our economies will generate and manage information on ourselves.


We should exploit this to its fullest so that we can understand life, on every scale, in every part of the world we inhabit.

Monday, 19 January 2015

Untangling Big Data

"Big Data" is a trendy, catch-all phrase for handling large datasets in all sorts of domains: finance, advertising, food distribution, physics, astronomy and molecular biology - notably genomics. It means different things to different people, and has inspired any number of conferences, meetings and new companies. Amidst the general hype, some outstanding examples shine forth and today sees an exceptional Big Data analysis paper by a trio of EMBL-EBI research labs - Oliver Stegle, John Marioni and Sarah Teichmann - that shows why all this attention is more than just hype.

The paper is about analysing single-cell transcriptomics. The ability to measure all the RNA levels in a single cell simultaneously - and to do so in many cells at the same time - is one of the most powerful new technologies of this decade. Looking at gene regulation cell by cell brings genomics and transcriptomics closer to the world of cellular imaging. Unsurprisingly, many of the things we've had to treat as homogenous samples in the past - just because of the limitations of biochemical assays - break apart into different components at the single-cell level. The most obvious examples are tissues, but even quite "homogenous" samples separate into different constituents.

These data pose analysis challenges, the most immediate of which are technical. Single-cell transcriptomics requires quite aggressive PCR, which can easily be variable (for all sorts of reasons). The Marioni group created a model that both measures and accounts for this technical noise. But in addition to technical noise there are other large sources of variability, first and foremost of which is the cell cycle. 


Cell cycle redux

For the non-biologists reading this, cells are nearly always dividing, and when they're not they are usually paused in a specific state. Cell division is a complex dance: not only does the genome have to be duplicated, but much of the internal structure has also be split - the nucleus has to dissassemble and reassemble each time (that's just for eukaryotic cells, not bacteria). This dance has been pieced together thanks to elegant research conducted over the past 30 years in yeast (two different types), frog cells, human cells and many others. But much remains to be understood. Because cells divide multiple times, the fundamental cycle (the cell cycle) has very tightly defined stages when specific processes must happen. Much of the cell cycle is controlled by both protein regulation and gene regulation. Indeed, the whole process of the nucleus "dissolving", sister DNA chromosomes being pulled to either side, and the nucleus reassembling has a big impact on RNA levels. 

When you are measuring cells in bulk (i.e. 10,000 or more at the same time), the results will be weighted by the different 'lengths of stay' in different stages of the cell cycle. (You can sometimes synchronise the cell cycle, which is useful for research into the cell cycle, but it's hard to do routinely on any sample of interest). Now that we have single-cell measurements, which presumably tell us something about cell-by-cell variation, we also have an elephant in the room: namely, massive variation due to the cells being at different stages of the cell cycle. Bugger.

One approach is to focus on cell populations that have paused (presumably in a consistent manner) in the cell cycle, like dendritic cells. But this is limiting, and many of the more interesting processes happen during cell proliferation; for example, Sarah Teichmann's favourite process of T-cell differentiation nearly always occurs in the context of proliferating cells. If we want to see things clearly, we need to somehow factor out the cell-cycle variation so we can look at other features.


Latent variables to the rescue

Taking a step back, our task is to untangle many different sources of variation - technical noise, the cell cycle and other factors - understand them, and set them to the side. Once we do that, the interesting biology will begin to come out. This is generally how Oliver Stegle approaches most problems, in particular using Bayesian techniques to coax unknown, often complex factors (also called 'latent variables') from the data. For these techniques to work you need a lot of data (i.e. Big Data) to allow for variance decomposition, which can show how much each variable contributes to the others. 

But even the best algorithm needs good targeting. Rather than trying to learn everything at once, Oli, John and Sarah set up the method to learn the identity of cell-cycling genes from a synchronised dataset - learning both well-stablished and some anonymous genes. They brought that gene list into the context of single-cell experiments to learn the behaviour of these genes in a particular cell population, paying careful attention to technical noise. Et voilà: one can split the variation between cells into 'cell-cycle components' (in effect, assigning each cell to its cell-cycle stage), 'technical noise' and 'other variation'. 

This really changes the result. Before applying the method, the cells looked like a large, variable population. After factoring out the cell cycle, two subpopulations emerged that had been hidden by the overlay of the variable cell cycle position, cell by cell, and those two subpopulations correlated to aspects of T-cell biology. Taking it from there, they could start to to model other aspects as specific latent variables, such as T-cell differentiation.


You say confounder, I say signal

We are going to see variations on this method again and again (in my research group, we are heavy users of Oliver's latent-variable work). This variance decomposition is about splitting different components apart and showing them more clearly. If you are interested in the cell cycle, cell-cycle decomposition, or how certain details of factor changes differ between cell populations, it will be incredibly useful. If you are interested in differentiation, you can now "factor out" the cell cycle. In contrast, you might only be interested in the cell cycle and prefer to drop out other biological sources of variation. Even the technical variation is interesting if you are looking at optimising the PCR or machine conditions. "Noise" is a pejorative term here - it's all variation, with different sources and explanations. 

These techniques are not just about the cell cycle or single-cell genomics. Taken together, they represent a general mindset of isolating, understanding and ultimately modelling sources of variation in all datasets, whether they are cells, tissues, organs, whole organisms or populations. It is perhaps counter-intuitive to consider that if you have enough samples with enough homogenous dimensions (e.g. gene expression, metabolites, or other features), you can cope with data that is otherwise quite variable by splitting out the different components. 

This will be a mainstay for biological studies over this century. In many ways, we are just walking down the same road that the founders of statistics (Fisher, Pearson and others) laid down a century ago in their discussions on variance. But we are carrying on with far, far more data points and previously unimaginable abilities to compute. Big Data is allowing us to really get a grip on these complex datasets, using statistical tools, and thus to see the processes of life more clearly.