sábado, 28 de fevereiro de 2015

DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis


My R/Bioconductor package, DOSE, published in Bioinformatics. Summary: Disease ontology (DO) annotates human genes in the context of disease. DO is important annotation in translating molecular findings from high-throughput data to clinical relevance. DOSE is an R package providing semantic similarity computations among DO terms and genes which allows biologists to explore the similarities of diseases and of gene functions in disease perspective. Enrichment analyses including hypergeometric model and gene set enrichment analysis are also implemented to support discovering disease associations of high-throughput biological data. This allows biologists to verify disease relevance in a biological experiment and identify unexpected disease associations. Comparison among gene clusters is also supported. Read More: 1118 Words Totally

from R-bloggers http://ift.tt/xdvlrq

One weird trick to compile multipartite dynamic documents with Rmarkdown


This afternoon I stumbled across this one weird trick an undocumented part of the YAML headers that get processed when you click the ‘knit’ button in RStudio. Knitting turns an Rmarkdown document into a specified format, using the rmarkdown package’s render function to call pandoc (a universal document converter written in Haskell).

If you specify a knit: field in an Rmarkdown YAML header you can replace the default function (rmarkdown::render) that the input file and encoding are passed to with any arbitrarily complex function.

For example, the developer of slidify passed in a totally different function rather than render - slidify::knit2slides.

I thought it’d be worthwhile to modify what was triggered upon clicking that button - as simply as using a specified output file name (see StackOverflow here), or essentially running a sort of make to compose a multi-partite document.

Here’s an exemplar Rmarkdown YAML header which threads together a document from 3 component Rmarkdown subsection files:

  • The title: field becomes a top-level header (#) in the output markdown
  • The knit: field (a currently undocumented hook) replaces rmarkdown::render with a custom function specifying parameters for the rendering.

    Yes, unfortunately it does have to be an unintelligible one-liner unless you have your own R package [with the function exported to the namespace] to source it from (as package::function). Here’s the above more clearly laid out:

    • Firstly, every section’s Rmarkdown file is rendered into markdown [with the same name by default]
    • Each of these files are ‘included’ after the ‘body’ (cf. the header) of this README, if they’re in the includes: after_body:[...] list.
    • The quiet=TRUE parameter silences the standard “Output created: …” message following render() which would otherwise trigger the RStudio file preview on the last of the intermediate markdown files created.
    • After these component files are processed, the final README markdown is rendered (includes appends their processed markdown contents), and this full document is previewed.
  • All Rmd files here contain a YAML header, the constituent files having only the output:md_document:variant field:

    …before their sub-section contents:

  ## Comparison of cancer types surveyed    Comparing cancer types in this paper to CRUK's most up to date prevalence   statistics...  

Alternative modular setup

One of the problems custom knit functions can also solve is the time it takes for large manuscripts to compile - a huge obstacle to my own use of Rmarkdown which I’m delighted to overcome, and what’s stopped me from recommending it to others as a practical writing tool despite its clear potential.

E.g., if using knitcitations, each reference is downloaded even if the bibliographic metadata has already been obtained. Along with generating individual figures etc., the time to ‘compile’ an Rmarkdown document can therefore scale exorbitantly when writing a moderately sized manuscript (rising from seconds to tens of minutes in the extreme as I saw on a recent essay), breaking the proper flow of writing and review, and imposing a penalty on extended Rmarkdown compositions.

A modular structure is the only rational way of doing this, but isn’t described anywhere for Rmarkdown’s dynamic documents (to my knowledge?).

In such a framework, the ‘main’ document’s knit function would be as above, but lacking the first step of compiling each .Rmd.md (these having been done separately upon each edit), so that pre-made .md files would just be included (instantly) in the final document:

Much more sensibly, the edited Rmarkdown component files (subsections) wouldn’t need to be re-processed — e.g. have all references and figures generated — rather this would be done per file, each of which could in turn potentially have custom knit: hooks (though note that the example below only works to prevent the file preview, there’s scope to do much more with it)

  via Software Carpentry  

The idea would be to follow what this Software Carpentry video describes regarding makefiles for reproducible research papers. In theory, the initially described knit: function could generate a full paper including analyses from component section files, each of which could in turn have their own knit: hooks.

The example above creates a README.md file suitable for display in a standard GitHub repository, though it’s not advisable to write sprawling READMEs: it could easily be tweaked to give a paper.pdf as for the Software Carpentry example, using a PDF YAML output header instead for the final .md.pdf step after including the component parts.

For what it’s worth, my current YAML header for a manuscript in PDF is:

… and in the top matter (after the YAML, before the markdown, for the LaTeX engine & R):

A minor limitation I see here is that it’s not possible to provide subsection titles through metadata — at present the title is written to markdown with a hardcoded ‘# ’ prefix. In a reproducible manuscript utopia the title: field could still be specified and markdown header prefix of the appropriate level generated accordingly perhaps (which might also allow for procedural sub-section numbering - 1.2, 1.2.1 etc.).

The above can also be found on my GitHub development notes Wiki, but it’s not possible to leave comments there. Feedback and more tips and tricks for Rmarkdown workflows are welcome.

✎ Check out the rmarkdown package here, and general Rmd documentation here.

from R-bloggers http://ift.tt/xdvlrq

Book Review: Mastering Scientific Computing with R


PACKT marketing guys again contact me to review their new book Mastering Scientific Computing with R.  The book 432 pages (including covers) book is consist of 10 chapters which starts from basic R and ends with advanced data management. However, ...

from R-bloggers http://ift.tt/xdvlrq

Tools in Tandem – SQL and ggplot. But is it Really R?


Increasingly I find that I have fallen into using not-really-R whilst playing around with Formula One stats data. Instead, I seem to be using a hybrid of SQL to get data out of a small SQLite3 datbase and into an R dataframe, and then ggplot2 to render visualise it. So for example, I’ve recently been […]

from R-bloggers http://ift.tt/xdvlrq

Scalable Machine Learning for Big Data Using R and H2O


Part I Part II H2O is an open source parallel processing engine for machine learning on Big Data. This prediction engine is designed by, h20, a Mountain View-based startup that has implemented a number of impressive statistical and machine learning algorithms to run on HDFS, S3, SQL and NoSQL. We were honored to have Tom Kraljevic (Vice President of Engineering…

Read more

The post Scalable Machine Learning for Big Data Using R and H2O appeared first on Data Science Las Vegas (DSLV).

from R-bloggers http://ift.tt/xdvlrq



A new release of RcppEigen is now on CRAN and in Debian. It synchronizes the Eigen code with the 3.2.4 upstream release, and updates the RcppEigen.package.skeleton() package creation helper to use the kitten() function from pkgKitten for enhanced pac...

from R-bloggers http://ift.tt/xdvlrq

Playing around with #rstats twitter data


As a bit of weekend fun, I decided to briefly look into the #rstats twitter data that Stephen Turner collected and made available (thanks!). Essentially, this data set contains some basic information about over 100,000 tweets that contain the hashtag…

Continue reading

from R-bloggers http://ift.tt/xdvlrq

sexta-feira, 27 de fevereiro de 2015

John Chambers Statistical Software Award 2015


In 1998 John M. Chambers (now a member of R-core) won the ACM Software System Award for the S Language, which (in the words of the committee) "forever altered how people analyze, visualize, and manipulate data". John graciously donated the prizemoney to support budding researchers in statistical computing: his Statistical Software Award has been granted annually since 2000. For the 2015 award, an individual or a team can apply: Teams of up to 3 people can participate in the competition, with the cash award being split among team members. The travel allowance will be given to just one individual in...

from R-bloggers http://ift.tt/xdvlrq

Data Science/Statistics/R @Google


This meetup will be hosted by Google and we’ll have Peter Lipman and Pete Meyer...

from R-bloggers http://ift.tt/xdvlrq

Does Balancing Classes Improve Classifier Performance?


It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer one. But I have always been … Continue reading Does Balancing Classes Improve Classifier Performance? Related posts:
  1. Don’t use correlation to track prediction performance
  2. The Geometry of Classifiers
  3. Can a classifier that never says “yes” be useful?

from R-bloggers http://ift.tt/xdvlrq