Gene data to hit milestone

Posted: July 18, 2012 at 9:17 pm

DNA microarrays allow researchers to analyse the expression of a huge number of genes simultaneously.

A. Nantel/Shutterstock

Purvesh Khatri sits in front of an oversized computer screen, trawling for treasure in a sea of genetic data. Entering the search term breast cancer into a public repository called the Gene Expression Omnibus (GEO), the postdoctoral researcher retrieves a list of 1,170 experiments, representing nearly 33,000 samples and a hoard of gene-expression data that could reveal previously unseen patterns.

That is exactly the kind of search that led Khatris boss, Atul Butte, a bioinformatician at the Stanford School of Medicine in California, to identify a new drug target for diabetes. After downloading data from 130 gene-expression studies in mice, rats and humans, Butte looked for genes that were expressed at higher levels in disease samples than in controls. One gene was strikingly consistent: CD44, which encodes a protein found on the surface of white blood cells, was differentially expressed in 60% of the studies (K. Kodama et al. Proc. Natl Acad. Sci. USA 109, 70497054; 2012). The CD44 protein is not widely investigated as a drug target for diabetes, but Buttes team found that treating obese mice with an antibody against it caused their blood glucose levels to drop.

Butte and his team are now using publicly available data to answer a diverse range of questions Khatri, for instance, hopes to discover secrets behind kidney-transplant rejection. We dont do wet lab experiments for discovery, he says. Those are for validating hypotheses. The beauty of analysing data from multiple experiments is that biases and artefacts should cancel out between data sets, helping true relationships to stand out, Butte says. There is safety in numbers.

And those numbers are rising rapidly. Since 2002, many scientific journals have required that data from gene-expression studies be deposited in public databases such as GEO, which is maintained by the National Center for Biotechnology Information in Bethesda, Maryland, and ArrayExpress, a large gene-expression repository at the European Bioinformatics Institute (EBI) in Hinxton, UK. Some time in the next few weeks, the number of deposited data sets will top one million (see Data dump).

The result is an unprecedented resource that promises to drive down costs and speed up progress in understanding disease. Gene-sequence data are already shared extensively, but expression data are more complex and can reveal which genes are the most active in, say, liver versus brain cells, or in diseased versus healthy tissue. And because studies often look at many genes, researchers can repurpose the data sets, asking questions other than those posed by the original researchers.

Sources: NIH, EBI

It is easy to track how many data sets are being deposited much harder is working out how they are being used. Heather Piwowar, who studies data reuse with the National Evolutionary Synthesis Center from the University of British Columbia in Vancouver, Canada, found that 20% of data sets deposited in GEO in 2005 and 17% of those in 2007 had been cited by the end of 2010. But those rates are certainly underestimates, she says. The PubMed Central repository, which her study relied on, holds only about one-third of the relevant papers, and her algorithms identify reuse only when researchers cite database accession numbers, which many dont do. More studies are reusing data every year, she says. We have every reason to believe it is game-changing.

Having access to such data is immensely valuable, agrees Enrico Petretto, a genomicist at Imperial College London. We would never be in a position to look across multiple tissues and species with the money we have. But he cautions that using other peoples data can be tricky. If data sets give contradictory outcomes, it is unclear whether that is because the underlying data contradict each other or because something went wrong with the analysis. Thats why people sometimes dont trust this, he says.

More here:
Gene data to hit milestone

Related Post

Comments are closed.