Home About Blog pRojects

Extending sparklyr to Compute Cost for K-means on YARN Cluster with Spark ML Library

Machine and statistical learning wizards are becoming more eager to perform analysis with Spark ML library if this is only possible. It’s trendy, posh, spicy and gives the feeling of doing state of the art machine learning and being up to date with the newest computational trends. It is even more sexy and powerful when computations can be performed on the extraordinarily enormous computation cluster - let’s say 100 machines on YARN hadoop cluster makes you the real data cruncher! In this post I present sparklyr package (by RStudio), the connector that will transform you from a regular R user, to the supa! data scientist that can invoke Scala code to perform machine learning algorithms on YARN cluster just from RStudio! Moreover, I present how I have extended the interface to K-means procedure, so that now it is also possible to compute cost for that model, which might be beneficial in determining the number of clusters in segmentation problems. Thought about learnig Scala? Leave it - user sparklyr!

BioC 2016 Conference Overview and Few Ways of Downloading TCGA Data

Few weeks ago I have a great pleasure of attending BioC 2016: Where Software and Biology Connect Conference at Stanford, where I have learned a lot! It wouldn’t be possible without the scholarship that I received from Bioconductor (organizers), which I deeply appreciate. It was an excellent place for software developers, statisticians and biologists to exchange their experiences and to better explain their work, as the understanding between collaborators in interdisciplinary teams is essential. In this post I present my thoughts and feelings about the event and I share the knowledge that I have learned during the event, i.e. about many ways of downloading The Cancer Genome Atlas data.

LDAvis Show Case on R-Bloggers

Text mining is a new challenge for machine wandering practitioners. The increased interest in the text mining is caused by an augmentation of internet users and by rapid growth of the internet data which is said that in 80% is a text data. Extracting information from articles, news, posts and comments have became a desirable skill but what is even more needful are tools for text mining models diagnostics and visualizations. Such visualizations enable to better understand the insight from a model and provides an easy interface for presenting your research results to greater audience. In this post I present the Latent Dirichlet Allocation text mining model for text classification into topics and a great LDAvis package for interactive visualizations of topic models. All this on R-Bloggers posts!

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms

Feature selection is a process of extracting valuable features that have significant influence on dependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Boruta and entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison on Venn Diagram carried out on data from the RTCGA factory of R data packages.

R Hero saves Backup City with archivist and GitHub

Have you ever suffered because of the impossibility of reproducing graphs, tables or analysis’ results in R? Have you ever bothered yourself for not being able to share R objects (i.e., plots or final analysis models) within your reports, posters or articles? Or maybe simply you have too many objects you can’t manage to store in a convenient and handy way? Now you can share partial results of analysis, provide hooks to valuable R objects within articles, manage analysis’ results and restore objects’ pedigree with archivist package and its extension archivist.github, allautomatically through GitHub without closing RStudio. If you are tired of archiving results by yourself, then read how you can became an R Hero with the archivist.github package power.

Survival plots have never been so informative

Hadley Wickham’s ggplot2 version 2.0 revolution, at the end of 2015, triggered many crashes in dependent R packages, that finally led to deletions of few packages from The Comprehensive R Archive Network. It occured that survMisc package was removed from CRAN on 27th of January 2016 and R world remained helpless in the struggle with the elegant visualizations of survival analysis. Then a new tool - survminer package, created by Alboukadel Kassambara - appeared on the R survival scene to fill the gap in visualizing the Kaplan-Meier estimates of survival curves in elegant grammar of graphics like way. This blog presents main features of core ggsurvplot() function from survminer package, which creates the most informative, elegant and flexible survival plots that I have seen!

R 3.3.0 is another motivation for Docker

Have you ever encountered R packages versioning issues when one application required different dependent packages versions than other? Have you ever got stuck with your project because of wrong pre-installed software versions on machine on which you should run your code? Or maybe you had heavy adventures with installing R software on a new machine because you couldn’t recall all installation steps like; what have I done 2 years ago that RCurl works on my local machine but I can’t install it now on my virtual machine with Windows? Or maybe installation of your R project on new machine was easy but admin couldn’t manage with this process, as he’s not regular R user? If you ever find it problematic to move your R applications to other machines, then this Docker guid post is for you!

RTCGA factory of R packages - Quick Guide

Yesterday we have been delivered with the new version of R - R 3.3.0 (codename Supposedly Educational). This enabled Bioconductor (yes, not all packages are distributed on CRAN) to release it’s new version 3.3. This means that all packages held on Bioconductor, that were under rapid and vivid development, have been moved to stable-release versions and now can be easily installed. This happens once or twice a year. With that date I have finished work with RTCGA package and released, on Bioconductor, the RTCGA Factory of R Packages. Read this quick guide to find out more about this R Toolkit for Biostatistics with the usage of data from The Cancer Genome Atlas study.

Improve your shiny dashboard with Disqus panel

Getting users feedback is always a pleasant moment. In most cases in World of Open Source we are creating tools and applications for people and we love to hear that someone thinks our (generally pet) project is useful. Mostly this moment is nicer than any paycheck. Check this post to find out how to provide easy interface for collecting the user feedback!

Answers to FAQ about SparkR for R users

Many people keep asking me whether I have tried SparkR, is it worth using, is it sexy or WHAT is it at all. I felt that creating frequently asked questions (FAQ) in the field of WHAT is that Spark/SparkR? would help many R Scientists to understand this Big Data Buzz-tool. I have gathered information from the documentation and some code from stackoverflow questions in preparation for the list below.