MySQL is the clear winner of the April ranking, but the various forks do not show a uniform performance.

from DB-Engines.com Blog http://ift.tt/1sb9rtx

http://ift.tt/1DpjrJ5

MySQL is the clear winner of the April ranking, but the various forks do not show a uniform performance.

from DB-Engines.com Blog http://ift.tt/1sb9rtx

MySQL is the clear winner of the April ranking, but the various forks do not show a uniform performance.

from DB-Engines.com Blog http://ift.tt/1sb9rtx

http://ift.tt/1Di1dHS

from SQL Server Pro http://ift.tt/1kCjeYg

Those of you who are already hooked into Azure SQL Database know that v12 was in Public Preview in December of 2014 and has been available in most regions since February of this year, 2015. Despite Microsoft making a really big announcement about this release not that many people are aware of it. They really should be.

from SQL Server Pro http://ift.tt/1kCjeYg

http://ift.tt/1ycxj1R

from SQL Server Pro http://ift.tt/1kCjeYg

The IT/Dev Connections super early-bird discount ends on Tuesday, March 31st and it is the lowest price of the year.

from SQL Server Pro http://ift.tt/1kCjeYg

http://ift.tt/1FGxUjX

from SQL Server Pro http://ift.tt/1kCjeYg

Thursday March 26th at 1:00 pm (EST) 18:00 (GMT) - Oracle Database & RAC Management Pack for System Center Operations Manager Webinar

from SQL Server Pro http://ift.tt/1kCjeYg

http://ift.tt/1IlbRxv

from SQL Server Pro http://ift.tt/1kCjeYg

from SQL Server Pro http://ift.tt/1kCjeYg

You can restore the bacpac file from a SQL Azure Database in a local SQL SERVER Deploy, using the Management Studio.

An you are all set

Check out my new book about R Language http://www.amazon.com/dp/B00SX6WA06

- Just Right click on Connection > Databases and select "Import Data-tier application..."
- Select "Next" on the introduction step.
- Click Browse to locate the bacpac file
- Alternatively, you can import from a bacpac file in the AZURE storage, but I wouldn't recommend that due to performance issues
- Click Next

An you are all set

Check out my new book about R Language http://www.amazon.com/dp/B00SX6WA06

http://ift.tt/1FbNBzv

Geomorph users,Our function plotTangentSpace() performs a Principal Components Analysis (PCA) of shape variation and plots two dimensions of tangent space for a set of Procrustes-aligned specimens and also returns the shape cha...

from R-bloggers http://ift.tt/xdvlrq

Geomorph users,Our function plotTangentSpace() performs a Principal Components Analysis (PCA) of shape variation and plots two dimensions of tangent space for a set of Procrustes-aligned specimens and also returns the shape cha...

from R-bloggers http://ift.tt/xdvlrq

http://ift.tt/18NTaGo

I am forever returning to PubMed data, downloaded as XML, trying to extract information from it and becoming deeply confused in the process. Take the seemingly-simple question “how many retracted articles are there in PubMed?” Well, one way is to search for records with the publication type “Retracted Article”. As of right now, that returns […]

from R-bloggers http://ift.tt/xdvlrq

I am forever returning to PubMed data, downloaded as XML, trying to extract information from it and becoming deeply confused in the process. Take the seemingly-simple question “how many retracted articles are there in PubMed?” Well, one way is to search for records with the publication type “Retracted Article”. As of right now, that returns […]

from R-bloggers http://ift.tt/xdvlrq

http://ift.tt/1GwWgK2

## Example: linear regression

### Using tidiers for visualization with ggplot2

from R-bloggers http://ift.tt/xdvlrq

The concept of “tidy data”, as introduced by Hadley Wickham, offers a powerful framework for data manipulation, analysis, and visualization. Popular packages like dplyr, tidyr and ggplot2 take great advantage of this framework, as explored in several recent posts by others.

But there’s an important step in a tidy data workflow that so far has been missing: the **output** of R statistical modeling functions isn’t tidy, meaning it’s difficult to manipulate and recombine in downstream analyses and visualizations. Hadley’s paper makes a convincing statement of this problem (emphasis mine):

While model inputs usually require tidy inputs, such attention to detail doesn’t carry over to model outputs. Outputs such as predictions and estimated coefficients aren’t always tidy. This makes it more difficult to combine results from multiple models.For example, in R, the default representation of model coefficients is not tidy because it does not have an explicit variable that records the variable name for each estimate, they are instead recorded as row names. In R, row names must be unique, so combining coefficients from many models (e.g., from bootstrap resamples, or subgroups) requires workarounds to avoid losing important information.This knocks you out of the flow of analysis and makes it harder to combine the results from multiple models. I’m not currently aware of any packages that resolve this problem.

In this new paper I introduce the **broom** package (available on CRAN), which bridges the gap from untidy outputs of predictions and estimations to the tidy data we want to work with. It takes the messy output of built-in statistical functions in R, such as `lm`

, `nls`

, `kmeans`

, or `t.test`

, as well as popular third-party packages, like gam, glmnet, survival or lme4, and turns them into tidy data frames. This allows the results to be handed to other tidy packages for downstream analysis: they can be recombined using dplyr or visualized using ggplot2.

As a simple example, consider a linear regression on the built-in `mtcars`

dataset:

` ``fit <- lm(mpg ~ wt + qsec, mtcars) summary(fit)`

` ``## ## Call: ## lm(formula = mpg ~ wt + qsec, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.3962 -2.1431 -0.2129 1.4915 5.7486 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 19.7462 5.2521 3.760 0.000765 *** ## wt -5.0480 0.4840 -10.430 2.52e-11 *** ## qsec 0.9292 0.2650 3.506 0.001500 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.596 on 29 degrees of freedom ## Multiple R-squared: 0.8264, Adjusted R-squared: 0.8144 ## F-statistic: 69.03 on 2 and 29 DF, p-value: 9.395e-12`

This summary shows many kinds of statistics describing the regression: coefficient estimates and p-values, information about the residuals, and model statistics like $R^2$ and the F statistic. But this format isn’t convenient if you want to combine and compare multiple models, or plot it using ggplot2: you need to turn it into a data frame.

The broom package provides three tidying methods for turning the contents of this object into a data frame, depending on the level of statistics you’re interested in. If you want statistics about each of the coefficients fit by the model, use the `tidy()`

method:

` ``library(broom) tidy(fit)`

` ``## term estimate std.error statistic p.value ## 1 (Intercept) 19.746223 5.2520617 3.759709 7.650466e-04 ## 2 wt -5.047982 0.4839974 -10.429771 2.518948e-11 ## 3 qsec 0.929198 0.2650173 3.506179 1.499883e-03`

Note that the rownames are now added as a column, `term`

, meaning that the data can be combined with other models. Note also that the columns have been given names like `std.error`

and `p.value`

that are more easily accessed than `Std. Error`

and `Pr(>|t|)`

. This is true of all data frames broom returns: they’re designed so they can be processed in additional steps.

If you’re interested in extracting per-observation information, such as fitted values and residuals, use the `augment()`

method, which adds these to the original data:

` ``head(augment(fit))`

` ``## .rownames mpg wt qsec .fitted .se.fit .resid ## 1 Mazda RX4 21.0 2.620 16.46 21.81511 0.6832424 -0.81510855 ## 2 Mazda RX4 Wag 21.0 2.875 17.02 21.04822 0.5468271 -0.04822401 ## 3 Datsun 710 22.8 2.320 18.61 25.32728 0.6397681 -2.52727880 ## 4 Hornet 4 Drive 21.4 3.215 19.44 21.58057 0.6231436 -0.18056924 ## 5 Hornet Sportabout 18.7 3.440 17.02 18.19611 0.5120709 0.50388581 ## 6 Valiant 18.1 3.460 20.22 21.06859 0.8032106 -2.96858808 ## .hat .sigma .cooksd .std.resid ## 1 0.06925986 2.637300 2.627038e-03 -0.32543724 ## 2 0.04436414 2.642112 5.587076e-06 -0.01900129 ## 3 0.06072636 2.595763 2.174253e-02 -1.00443793 ## 4 0.05761138 2.641895 1.046036e-04 -0.07164647 ## 5 0.03890382 2.640343 5.288512e-04 0.19797699 ## 6 0.09571739 2.575422 5.101445e-02 -1.20244126`

Finally, `glance()`

computes per-model statistics, such as $R^2$, AIC, and BIC:

` ``glance(fit)`

` ``## r.squared adj.r.squared sigma statistic p.value df logLik ## 1 0.8264161 0.8144448 2.596175 69.03311 9.394765e-12 3 -74.36025 ## AIC BIC deviance df.residual ## 1 156.7205 162.5834 195.4636 29`

As pointed out by Mike Love, the `tidy`

method makes it easy to construct coefficient plots using **ggplot2**:

` ``library(ggplot2) td <- tidy(fit, conf.int = TRUE) ggplot(td, aes(estimate, term, color = term)) + geom_point() + geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) + geom_vline()`

When combined with dplyr’s `group_by`

and `do`

, **broom** also lets you perform regressions within groups, such as within automatic and manual cars separately:

` ``library(dplyr) mtcars %>% group_by(am) %>% do(tidy(lm(mpg ~ wt, .)))`

` ``## Source: local data frame [4 x 6] ## Groups: am ## ## am term estimate std.error statistic p.value ## 1 0 (Intercept) 31.416055 2.9467213 10.661360 6.007748e-09 ## 2 0 wt -3.785908 0.7665567 -4.938848 1.245595e-04 ## 3 1 (Intercept) 46.294478 3.1198212 14.838824 1.276849e-08 ## 4 1 wt -9.084268 1.2565727 -7.229401 1.687904e-05`

This is useful for performing regressions or other analyses within each gene, country, or any other kind of division in your tidy dataset.

The broom package provides tidying methods for many other packages as well. These tidiers serve to connect various statistical models seamlessly with packages like dplyr and ggplot2. For instance, we could create a LASSO regression with the glmnet package:

` ``library(glmnet) set.seed(03-19-2015) # generate data with 5 real variables and 45 null, on 100 observations nobs <- 100 nvar <- 50 real <- 5 x <- matrix(rnorm(nobs * nvar), nobs) beta <- c(rnorm(real, 0, 1), rep(0, nvar - real)) y <- c(t(beta) %*% t(x)) + rnorm(nvar, sd = 3) glmnet_fit <- cv.glmnet(x,y)`

Then we tidy it with broom and plot it using ggplot2:

` ``tidied_cv <- tidy(glmnet_fit) glance_cv <- glance(glmnet_fit) ggplot(tidied_cv, aes(lambda, estimate)) + geom_line(color = "red") + geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .2) + scale_x_log10() + geom_vline(xintercept = glance_cv$lambda.min) + geom_vline(xintercept = glance_cv$lambda.1se, lty = 2)`

By plotting with ggplot2 rather than relying on glmnet’s built-in plotting methods, we gain access to all the tools and framework of the package. This allows us to customize or add attributes, or even to compare multiple LASSO cross-validations in the same plot. The same is true of the survival package:

` ``library(survival) surv_fit <- survfit(coxph(Surv(time, status) ~ age + sex, lung)) td <- tidy(surv_fit) ggplot(td, aes(time, estimate)) + geom_line() + geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .2)`

Others have explored how **broom** can help visualize random effects estimated with **lme4**. Other packages for which tidiers are implemented include gam, zoo, lfe, and multcomp.

The vignettes for the **broom** package offer other useful examples, including one on combining broom and dplyr, a demonstration of bootstrapping with broom, and a simulation of k-means clustering. The broom manuscript offers still more examples.

Tidying model outputs is not an exact science, and it is based on a judgment of the kinds of values a data scientist typically wants out of a tidy analysis (for instance, estimates, test statistics, and p-values). It is my hope that data scientists will propose and contribute their own features feature requests are welcome!) to help expand the universe of tidy analysis tools.

from R-bloggers http://ift.tt/xdvlrq

http://ift.tt/1GwUVTt

by Joseph Rickert The gradient boosting machine as developed by Friedman, Hastie, Tibshirani and others, has become an extremely successful algorithm for dealing with both classification and regression problems and is now an essential feature of any machine learning toolbox. R’s gbm() function (gbm package) is a particularly well crafted implementation of the gradient boosting machine that served as a model for the rxBTrees() function which was released last year as a feature of Revolution R Enterprise 7.3. You can think of rxBTRees(), as a scaled up version gbm() that is designed to work with massive data sets in distributed...

from R-bloggers http://ift.tt/xdvlrq

by Joseph Rickert The gradient boosting machine as developed by Friedman, Hastie, Tibshirani and others, has become an extremely successful algorithm for dealing with both classification and regression problems and is now an essential feature of any machine learning toolbox. R’s gbm() function (gbm package) is a particularly well crafted implementation of the gradient boosting machine that served as a model for the rxBTrees() function which was released last year as a feature of Revolution R Enterprise 7.3. You can think of rxBTRees(), as a scaled up version gbm() that is designed to work with massive data sets in distributed...

from R-bloggers http://ift.tt/xdvlrq

Assinar:
Postagens (Atom)