Friday, 14 October 2016

GA4GH: What? Why? How?

In August 2016, I was offered the position of Chair of the Global Alliance for Genomics and Health (GA4GH), and am delighted to accept. To give a little more background to the announcement going out today, and to provide some answers to personal questions, I’ve written a bit of Q&A.

In this post I answer the three questions I am most frequently asked about the GA4GH, namely: What on Earth is it? Why am I becoming Chair? And how on earth do I find the time for these things?

What is the Global Alliance for Genomics and Health?


The GA4GH is producing solutions for sharing genomic and clinical data responsibly. Low-cost, high-throughput sequencing has changed – and is still changing – the way we understand living things, from the basic science of life to human disease. Healthcare is a big focal point for this change and, now that there is a critical mass of knowledge about the path from personal sequence to treatment decisions, is able to embrace routine sequencing of patient genomes and other molecular measurements.

But if we want a future in which all people can benefit from this change, we need to solve a number of technical, structural, security and ethical problems. The Global Alliance for Genomics and Health was set up to to do just that.

At present, my overall impression of GA4GH is of a professional orchestra warming up: intense, but disjointed, activity and passion. The different sections are poised to work together, producing an ecosystem of harmonised technical and ethical standards.

How we measure


Over the next decade, healthcare will begin to change the way it collects molecular measurements from patients. 'Genomics entering the clinic' means that it will soon become a matter of routine to gather DNA, RNA, protein and metabolite data from patients, along with traditional information. (Note: I like to use ‘Genomics’ in a broad sense, encompassing DNA, RNA, protein and metabolite measurement, partly because terms like ‘omics' and ‘multiomics’ are clumsy and don't translate well beyond the field).

Incorporating new measurements is not new to healthcare: for example, blood biochemistry has been a mainstay in medical practice for over 50 years, and clinical genetics has been used to successfully diagnose millions of people in recent decades. Oncologists routinely use the presence or absence of specific genetic loci in certain tumours to guide treatment.

I think it is generally accepted that these more narrow, field-specific measurements of genomics will change to become a more comprehensive, routine collection of many molecular aspects at the same time, applicable to many areas of healthcare.

Genomes, whole-blood transcriptomes, tumour RNA-seq and large scale-metabolomics can all provide relevant information that is useful in assessing individuals, and still more useful when analysed collectively.

A game-changing opportunity


This is all a bit hairy in terms of skills transfer and capacity, but it’s much more exciting in terms of opportunity.

The consequence of this is that healthcare, a massive chunk of the world economy (between 8% and 20% of GDP in developed-world economies) is going to be conducting high-throughput molecular phenotyping on humans – a wonderful,outbred mammal.

From the perspective of research, which has traditionally looked to other organisms to understand how they work before translating that knowledge to humans, this is an amazing opportunity. Being able to use human data directly – particularly by gathering the huge datasets generated in routine healthcare – will be transformative for science.

The sheer ‘firepower’ of healthcare means humans will be the most studied organism on the planet. No other animal will come close in terms of scale, detail and longitudinal sampling. What an opportunity for research – both basic and applied!

Turbulent waters


Repurposing healthcare data for research, at scale, will not be smooth sailing. There are real cross currents around data, with different levels of access and rules of engagement buffeting one another.

Fundamentally, much of molecular biology research data is fully open, globally aggregated (e.g. ENA/GenBank, PDB, the Human Genome) or, in the case of human research subjects, distributed in accordance with different consents that patients have signed.

Healthcare data is completely different. There is a thicket of national legislation, each rooted deeply in national law, language and societal norms, and the primary remit of each system is to keep its citizens healthy – not to create resources for research.

As generating molecular data becomes more a matter of routine, the constraints for access will doubtless change, perhaps without reference to research and its potential to create better long-term solutions.

Another interesting driver for change is patient engagement. Increasingly, clinical research has become more of a two-way relationship, with patients empowered to be owners of their personal measurement data, in addition to being donors.

Even assuming that all goes well and access issues are resolved, there is the matter of handling data on massive scales, and being equipped to analyse it. Engineering around large-scale genomic data is no trivial matter. One can’t simply slurp up spreadsheets or STATA frames of genomes, transcriptomes and metabolomes – you need proper computational muscle.

There is also the opposite 'knowledge flow': for healthcare to leverage genomics well, there are many practical problems for which research holds the solutions. We need these solutions and skills to flow from basic research into healthcare.

If we want a future in which we can all benefit from our investment in studying humans on the molecular level, we need to solve these problems.

The major challenges, in a nutshell


Technical problems require solutions for working with data on different scales in sensible, portable ways.

Structural problems can be resolved when we agree on how to represent reference data. That includes, for example, genomes and variants, but also the way we describe things. The meta-data must be aligned to allow the transmission of key clinical data, and to allow data sharing more broadly.

(The devil is in the detail here. Consider the many ways one could represent ‘nested variation’ – a single nucleotide polymorphism on a structural insertion of DNA in the context of an inversion – something we elide over in both research and clinical practice.)

Ethical and regulatory problems are perhaps the most discussed across the board, and we must find a way for bona fide researchers to access data within an appropriate framework, globally.

Security problems require tight coordination. The GA4GH aims to establish a federation in which datasets are appropriately accessible. That means we need access tools like APIs and virtualisation schemes that can work smoothly, with predominantly secure electronic methods, and absolute clarity and constant forward thinking about security.

But as with all complex issues, many of the challenges are some kind of combination of problems, or hide in the spaces in between.

GA4GH: an ambitious endeavour


Resolving so many challenges in a relatively short time is certainly ambitious, but it can be achieved. It isn’t easy technically or socially, because it will only be effective if it is global. But we know from experience that it isn’t impossible. 

The extensive work done already shows that it is tractable: for example we already share (mainly by data transfer) large cohorts of patients for joint analysis delivering many new insights.

We have established appropriate ethical access to these schemes. We have also demonstrated, in specific areas, that federation can work (e.g. MatchMaker Exchange for rare disease patient discovery) and that virtualisation is an effective approach (e.g. PanCancer Analysis). 

Many academic and commercial groups in the GA4GH already provide practical solutions, but they are not as well coordinated as they should be.

The goal of the GA4GH is to enable a future in which secondary use of healthcare-generated genomics data is routine and practical, and we already have a strong start.
We need to make existing ad-hoc schemes better by coming together more.

Why am I Chairing the Global Alliance?


GA4GH has been in operation for three years, led first by David Altschuler (now at Vertex Pharmaceuticals), then by Tom Hudson (now at Abbvie). David and Tom oversaw the establishment of GA4GH and grew it from 90 to 433 partner organisations. Under their leadership the GA4GH set up a series of technical, meta-data, ethical, regulatory and security work streams, many of which have been very successful, if isolated. There have also been a number of exploratory projects set up, though many seem to be driven by curiosity and personal interest.

My goal for the GA4GH over the next three years is to rebalance delivery and structure, building on the partnership’s existing strength of exploratory work. Many people in GA4GH, and some outside the Alliance, are eager to see more alignment, and there is an incredible pool of talent in engineering, genomics, clinical and ethics, all ready to come together around this.

We may not be able to solve every challenge, as many of these eventually merge with healthcare informatics generally. But I am confident that we will make substantial progress and achieve a far better world for both healthcare and research.

Where do I find time for these things?

(No, I do not have a Time-Turner.)


When people who know me heard that I took on another leadership role, they rolled their eyes and either berated me for not saying no, or simply asked how on Earth I will balance this with my other responsibilities.

I am stretched a bit thin, between my leadership roles at EMBL-EBI, ELIXIR and Genomics England, my consultancy for Oxford Nanopore and GSK and my advisory role for other organisations, and other professional responsibilities. Importantly, I also have a life outside of science: I am a Dad with two children and a wonderful wife.

How could I take on being Chair of the GA4GH, with everything else going on? How could I … not?

I am an endlessly curious, optimistic person and love bringing people together to make collaborations work, even if having such diverse commitments requires time slicing, and results in my being distracted. In fact, for the past three years I have been quite active in GA4GH, but at a very technical level. This role is more than just guiding one or two working groups.

Team work

Working in teams – tight or loose – and in close partnership is my default strategy, in both work and family life. My wife and I are very much equitable partners, with demanding careers and full-time jobs. We are both responsible for making sure the logistics work (and that we have backup plans), and for setting aside quality time with our children. That said, one of the drawbacks of being spread thin is that sometimes I will be at home, but completely distracted by work – something that drives the whole family a bit nuts. I know I am not alone in struggling with this. Like many people I feel that I short-change my family, even as they support me completely.

I could not function as Director of EMBL-EBI without Rolf Apweiler as joint Director, and the high level of trust we share. Although we are chalk and cheese (focused, organised German and messy, problem-orientated Brit), our complementarity is a real strength. I also lead EMBL-EBI research in partnership with Nick Goldman, and as a group leader I’ve partnered previously with Ian Dunham and now with Tom Fitzgerald to lead my research projects. I see my roles with Genomics England, Oxford Nanopore and GSK as providing help, support and constructive criticism, but as a consultant my interactions are limited.

Chairing the GA4GH will be a partnership role I share with the Alliance’s strong Executive Director, Peter Goodhand of the Ontario Institute for Cancer Research, who keeps many of the processes working smoothly. We both plan to recruit an active set of Vice-Chairs who will provide a high level of strategic oversight. 

I know there is enough talent in the GA4GH to deliver this.

Enabling talented people to deliver


High-level leadership is mainly about providing the right space and conditions for knowledgeable, talented people to step up and deliver. Being clear about the vision and direction is incredibly important, but setting out a vision often isn’t the most challenging aspect. The hard thing is to identify the people who have the right mindset and skills, and enable them to drive part of the work all the way through to delivery.

I am speaking from experience when I say that this is true for leadership generally, both in formal organisations like EMBL-EBI and more loosely coupled organisations such as GA4GH. The problems we are grappling with cannot be resolved single-handed; rather, we will be able to deliver practical solutions by aligning individuals and groups, and ensuring they have the right balance of skills, enthusiasm, resources and motivation.

No human is an island.


The adventure of understanding things is deeply exciting for me, whether it’s a well-known problem or an unexplored area of science, so the GA4GH is a project after my own heart. As with any ambitious endeavour, there are bound to be arguments and hard decisions of all shapes and sizes in the GA4GH. But the motivation of people to participate, the rewards of collaboration and the potential benefits to society are great.

I am very lucky to be surrounded by supportive, excellent colleagues on every level: the people who manage me, my peers around the world and those I manage. I am also lucky to be working in science, which thrives on collaboration, information exchange and support, and where just being reciprocally nice is an excellent strategy.


Working together, we are going to make the next few years of GA4GH amazing. I cannot wait.

Tuesday, 6 September 2016

Sharing clinical data: everyone wins

Patients who contribute their data to research are primarily motivated by a desire to help others with the same plight, through the development of better treatments or even a cure. Out of respect for these individuals, and to uphold the fundamental tenets of the scientific process, I’d like the clinical trials community to shift its default position on data sharing and reuse to align to data availability on publication, similar to the life science community. This will enable more robust, rigorous research, create new opportunities for discovery and build trust between patients and scientists.

This aspiration is widely shared in the basic research community, and has been well articulated in considered and public discussions such as a series led by the National Academy of Sciences in 2003. Nevertheless, recent articles in the New England Journal of Medicine have pushed against data sharing, calling those who reuse data “research parasites” (followed by a bit of clarification) and concern about how best to structure clinical trial data sharing with a lengthy and complex embargo procedure, potentially including payment.

A tradition of sharing


Sharing tools has been the norm (mostly) for genetics and molecular biology since the early days of genetics, mainly because you couldn’t really get anything done unless people let you use their reagents. This has persisted for over a century, from the first studies of fly lines to cDNA clones, enzymes, antibodies and, now, ‘omics datasets.

The Protein Data Bank in the 1970s, the EMBL Bank (now ENA) and GenBank nucleotide collections in the 1980s and the Human Genome Project in the 1990s all thrived thanks to the norms of reagent sharing and data deposition, and the returns to science were - and are still - huge. Such practices are pragmatic in terms of both data quality and author credit, each of which provides incentives for researchers.

I am perhaps painting a beautiful picture of an imperfect world – there is still much to be done to ensure all this data sharing can work. Compliance, agreeing on things like adaptable standards, and keeping the infrastructure humming are all challenges we grapple with on a daily basis in molecular biology. But we have much to be proud of, and embracing the ethos of sharing has brought us a long way in a short time.

Data release: why?


Releasing data when you publish a paper isn’t about giving things up – although I can see that for some, the lack of instant reward might make one feel that way. Data release is not about rewarding a single PI; it’s about benefitting the clinical research community as a whole, and making the most of the data entrusted to you by patients. So - why release data?

We are custodians, not owners, of patient data.


Patients participate in trials to further medical research, benefit from new medicines (potentially) and gain from focused care and advice. But numerous surveys have shown that participants are primarily motivated to share their data – the most valuable aspect of a clinical trial – by the altruistic desire help others in the future.

So it is very strange that some researchers feel justified in assuming the data produced in a clinical trial is somehow their own scientific property. From the perspective of patient care, this position is particularly questionable when it impedes the ability of other scientists to re-examine the data for additional studies, which would contribute to the progress so eagerly desired by the participants.

If we’re not doing all this research to improve patient care, then probably we should change the consent process.

Challenging the interpretation of observations is fundamental to the scientific process.


Evidence is a wonderful thing. Our freedom to base our arguments on reproducible experiments dates back to the 17th Century, when people in Europe were finally permitted to openly discuss and debate science based on direct observation. Evidence is the backbone of scientific discourse, so it follows that papers without data can be easily dismissed as well-articulated speculation.

When a dataset is published, readers are then able to drill down to raw observations, and can verify methods or explore alternative explanations. Yes, this means they can potentially expose errors in your work and your thinking. But it’s far more common for readers to double-check the work against other published datasets, which can answer lots of different questions. Ultimately, this is a good thing for science.

Sharing data sharpens the mind.


The very real anxieties that come with data sharing are both individual and collective, because we are building knowledge together. Professional pride dictates that if your data will be open for inspection, you will be much more careful about the details. (After all that data cleaning and fixing, confounder/covariate discovery and adjustment, you do not want to be the one who left a howler for others to discover.)

Everyone knows there are skeletons in the data closet, mostly down to the complications of running real-life experiments, so current analyses make use of several approaches to boost confidence in the results. But generally speaking, just knowing your peers could be wandering through your data sharpens your mind and makes you focus on handling and presenting your analysis properly.
When an entire community does this, it benefits from a deeper consensus on what a “good study” looks like. That matters a lot.

Meta-analysis and Serendipity


When it can be done, meta-analysis (the combining of datasets) is a win–win–win (funders, scientists, patients). It’s about building on studies, combining them to gain new insights, asking different questions and finding new leads. Meta-analysis isn’t always possible – clinical trials often look at entirely different things, and even when they do study the same thing, they can’t always be aligned very well. But meta-analysis is only possible when people share their datasets.

Serendipity is another benefit of data sharing – I am always amazed at how important it is for science. Serendipity has guided us to some seriously profound insights, for example the relationship between the Malaria parasite and plants, or how metabolic enzymes can be used as lens crystals. It’s been behind many of the completely weird discoveries that make biology so wonderful, and many practical discoveries, such as CRISPR, that push the frontiers of possibility.

I’ve stumbled happily upon Serendipity many times, and very often others have made serendipitous discoveries based on data or methods I have published. You’d have to be pretty cynical to begrudge your fellow scientists such pleasure, and, frankly, a bit petty to fret over whether they’ll remember to credit you (nearly all scientists carefully reference their sources, if only to reassure reviewers of the credibility of the data they use).

For funders, both meta-analysis and serendipitous discoveries compound their return on investment and make them look good. For scientists, being able to make use of comparable data to verify or cross-validate their work, or to make unplanned discoveries, is invaluable. For patients, knowing their contribution is being used in lots of different and useful ways can give a sense of pride.

Sceptical about whether this really applies to clinical research? Well, without having access to a large number of trials, I doubt anyone could say.

Having more large datasets on hand for meta-analysis can only benefit those planning and analysing the results of clinical trials. And as clinical trials begin to incorporate more high-dimensional, data rich datasets (e.g. imaging, metabolomics, multi-omics) – and to share them – there will be plenty of opportunities to carry out sophisticated meta-analysis.

As for Serendipity, well, it can strike at any time.

The scoop


It is hardly possible for anyone to “scoop” you simply because you released your data on publication – particularly if that dataset represents only what is needed to support your paper. If someone else looks at that data and comes up with an interesting observation you missed, they can potentially make that corner of science a little bit better. Dwelling on the negatives will get you nowhere, but looking on the bright side may land you a new collaborator.

If the only datasets you share at publication time are those that relate specifically to that paper, there is no need for complicated embargo rules that provide authors enough time to perform a full analysis on all the data collected (as proposed in the most recent NEJM editorial). Tracking and versioning might become more complicated with later papers, but this approach does the important job of tying the datasets to the publication in a reasonable timeframe, opening up that piece of science for proper verification and discourse.

If you really believe you are going to be scooped for some missing analysis on a dataset, the solution is to delay publication. If you’re worried that making your data public will expose you to undue criticism, make your analysis bulletproof. That will be good for you and for the system as a whole, as understanding the strengths and weaknesses of different analyses only makes the community stronger.


When data sharing is not straightforward


Human subjects


No matter what, we have to honour patient consent. As scientists we may wish such agreements were more future-proof, but when those consents preclude data sharing beyond the study group, we have to accept it and move on.

Exactly how to future-proof consents for clinical trials is no simple matter. One solution would be for funding agencies or regulators to begin insisting that consent forms provide a reasonable level of research access, which would facilitate research but respect the privacy of individuals.

Currently, for genetic studies, there is a lightweight vetting process, involving both individual and institutional sign off, which assures patients that the researchers will perform appropriate research on the dataset. This is a clunky approach and it certainly needs improvement, but it is functional.

Standards and infrastructure


Data sharing is only feasible if the parties involved are able to do it, without worrying that they’ll run into trouble transferring files from one site to another, or that their data will disappear into some kind of black hole.

A robust, global archive for this kind of information would be one important piece of a larger infrastructure that would make biomedical data sharing straightforward. The EMBL-EBI model – biomolecular archives supported by international collaborations – is a solid example. Funding for infrastructure like this is huge value for money, and costs little in the context of global clinical research funding.

CDISC standards are functional, and well used by the clinical trials community. But there is a constant need to review standards and establish new ones for emerging technologies. This work never ends, but the end goal of harmonisation (i.e. to support meta-analysis) is a good one, and the whole process helps us along on our eternal quest for a shared language.

Regulatory and commercial concerns



I do not have a lot of experience in this area, but it’s clear that regulation of clinical trials is a huge deal for the pharmaceutical industry. Any data release policy needs to work well for the regulators, and for commercial interests, who can have different concerns from academia. For both, the science performed in clinical trials must be very sound, so that mind-sharpening step of data release is certainly of value, but most companies that I know are delighted when other science happens from the data they release. 

Evidence is beautiful


In this on-going debate about data, let us base our arguments on… data. We are all likely to change our views view when presented with compelling data and well-reasoned analysis, which is one of the nice things about being a scientist.

Refreshingly, for the most part I do not think this debate is one of those boring political ones where everyone chooses a side, closes their ears and steels themselves for uncomfortable dinner-table discussions. Scientists already working in an open-data environment understandably campaign for everyone to join them – though they are full aware of the downsides. Scientists working in clinical trials can see there are advantages to sharing data, but have neither the time nor the inclination to sort out the myriad details that would make it workable.

As a starting point, we can focus on the simplest, tried-and-tested approach of publishing your data alongside your narrative – a practice that has served science well for over 300 years. But more importantly, we can keep the discussion going, and work with one another to overcome the barriers to realising the full potential of biomedical research. That would be a win for scientists, their funders and, most importantly, patients themselves.