Few weeks ago I have a great pleasure of attending BioC 2016: Where Software and Biology Connect Conference at Stanford, where I have learned a lot! It wouldn’t be possible without the scholarship that I received from Bioconductor (organizers), which I deeply appreciate. It was an excellent place for software developers, statisticians and biologists to exchange their experiences and to better explain their work, as the understanding between collaborators in interdisciplinary teams is essential. In this post I present my thoughts and feelings about the event and I share the knowledge that I have learned during the event, i.e. about many ways of downloading The Cancer Genome Atlas data.
- Conference overview
- Get TCGA Data!
The day before the conference a Developer Day was organized. It was a good opportunity to find out what are currents projects about in Bioconductor and what are future plans for Bioconductor project. I didn’t know that most of Bioconductor developers are so keen on object oriented programming (I normally create tools for visualizations and reproducible research) and most of them were explaining improvements in Bioconductor S4 classes. Developer Day was held by Martin Morgan who let everyone to pronounce themselves which was a spark reviving the networking! The funniest idea during this day was splitting attenders to 2 separate rooms (even though we could fit in 1 room), so most of the talks (and jokes!) have to be said twice, once in each room. This was cool idea besides that I, as a volunteer for lightning talk about RTCGA project (presentation - rtcga.github.io/RTCGA/BioC2016/), couldn’t hear the presentations that were said in the other room as I spoke in mine room.
The conference was divided into 2 days. Each day started with talks form invited and community speakers - you can find all presentations at the conference’s website. I really enjoyed Sandrine Dudoit speech about various approaches to normalization and clustering and metrics to select the best one (new package clusterExperiment). Also interesting talk had Susan Holmes about Multicomponent data integration for the Human Microbiome but my favourite talk gave Jenny Bryan about Spreadsheets: the Data Format we Love to Hate. I’ve been overwhelmed by statistics she presented about Excel popularity again python and R. It seems that we’ll need to learn how to live with people using Excel because there are to many of them an R users are in minority!
Moreover, organizers have invited 2 big names: Rob Tibshirani and Dr. Robert Gentleman (Developer Day Keynote Address). Rob Tibshirani tried to convince everyone that most of the inference done from GLMnet feature selection was not right and proposed a new method, that is covered in his new package: selectiveInference. To my big surprise we also had a chance to listen to Ramnath Vaidya from Alteryx about simple and quick interface for creating a htmlwidget - interactive visualizations that does not have to be powered by shiny applications. Ramnath created a new widget during his 15 min talk!
I would add Jim Hester to big names section, as his presentation sparked heated discussion about the future of installation packages from Bioconductor. New function
devtools::install_bioc() might be a competition for
After lunch attenders had a chance to join workshops. I would like to recommend the Introduction to ImmuneSpaceR workshop by Renan Sauteraud from Fred Hutchinson Cancer Research Center for intermediate R users. I’ve learned a bit about R5 classes that Renan uses in his package and discovered many database from which I can drain as a data vampire with the help of his ImmuneSpaceR package which is already in Bioconductor’s release branch. slides github
Get TCGA Data!
The Cancer Genome Atlas project is an effort to provide publicly available, high quality clinical and genomic information collected from tumour (and healthy) samples gathered from patients suffering from over 30 types of cancers. If you haven’t hear about this project you can check slides from my BioC2016 lightning talk where you will find more helpful definitions and links.
RTCGA factory of R packages
I have started RTCGA project almost year and a half ago. It provides software package called RTCGA, which is available to download from Biocondcutor. This package enables you to download any dataset from TCGA for any cohort type and for every release date (TCGA releases it’s dataset over time) and also provides datasets in the whole family of R data packages.
To get information about available datasets in RTCGA see
?datasetsTCGA, or check
?checkTCGA to find out possible parameters’ values for
?downloadTCGA to download any dataset if those already prepared for R do not satisfy your needs.
Today I’ve got a message from Dario Strbenac (author of ClassifyR package) who wrote that
TCGA Data Portal has closed down and the data moved to Genomic Data Commons
so you might encounter some issues with
downloadTCGA for few days, but this is good news as I can close RTCGA project with putting final TCGA data snapshot into
ExperimentHub. This process is already moderated by Valerie Obenchain on project’s GitHub repository.
Almost at the same time as RTCGA reached Bioconductor, a similar package called RTCGAToolbox was published with suchlike functionality. It is as popular as RTCGA, if you look at the packages’ download stats
To download data from Firehose you could use
getFirehoseData function but it allows only to downloaded 15 datatypes and the highest possible snapshot date is
RTCGA::downloadTCGA could download datasets even from
bigrquery - R interface to Google BigQuery
The approach using bigrquery was new for me. We have used this package during Facilitated Discussion: Approaches to Data Modeling at the Developer Day.
Developer Day precedes the main conference on June 24, providing developers and would-be developers an opportunity to gain insights into project direction and software development best practice
You can have a look at basic way of extracting clinical information with bigrquery
This package looks to work faster than regular
downloadTCGA but is not as fast as
for clinical observations.
To compare, bigrquery returns only 65 variables for clinical information where RTCGA returns all of them and bigrquery allows to extract only specific datatypes (approximately 10) where RTCGA allows downloading any datatype (for BRCA there are 43 possible datasets -
nrow(checkTCGA(what = 'DataSets', 'BRCA'))).
If you are aware of any other method of downloading TCGA into R please let me know. You can write messages in the Disqus panel that is below. I would be happy to be able to compare them to RTCGA package.Tweet