What are the funniest R packages

Top 10 R packages for data science

The open source project R is one of the leading tools for data science and machine learning tasks. Due to the open source framework, there is continual posting and package libraries with new features come up frequently. The CRAN package repository currently has 12,525 packages available. This post takes a look at the most popular and useful packages that have set the standards for solving data manipulation, visualization, and machine learning problems.

Data manipulation

Dplyr

Dplyr is a must-have library for quick and easy data compilation and analysis in R. It is designed to work with data tables, including tables from MySQL, PostgreSQL, SQLite, and Google BigQuery. The special features of dplyr are the simplicity of the command syntax and the high performance.
The main concept of dplyr is to provide a few simple functions that take responsibility for general data manipulation problems. These five functions are:

  • mutate () - adds new columns that can be expressed as functions of existing variables.
  • select () - selects one or more variables.
  • filter () - selects individual cases according to the specifications.
  • summarize () - merges the part of the data as a single summary.
  • arrange () - puts the blanks in the desired order.

There is also the group_by () function that you can use to perform any group action.

In addition, dplyr accelerates all these commands with the C ++ backend, which makes it particularly popular when processing large amounts of data.

Data.table

Data.table is a laconic library designed for sophisticated data manipulation. With its help, you can perform many operations in just one line of code. In addition, data.table is faster than dplyr in some cases, and it can dictate the choice of whether there are memory or performance constraints. The library works with functions like subset, group, update, join and many more.
Data.table has a very clear general structure: DT [i, j, by], where the parameters i and j refer to rows and columns, respectively, which means that rows are suppressed using i and computation of j, and by refers to adding a group.

Sometimes data.table is used in conjunction with dplyr.

Graphic displays and HTML widgets

Ggplot2

Ggplot2 is one of the most popular data visualization packages for R users. It employs the idea of ​​grammar graphics and applies a system of concepts such as data arrays (univariate and multivariate, numeric and categorical), visual tools, geometric objects, statistical transformation of variables, coordinate systems, etc. to create these graphics. They are built up layer by layer and combine all of these main blocks above so that you get the type of graphic display you want. Ggplot2 has functions that solve many secondary questions or plot specifications, such as whether a legend is needed, where it should be placed or which limits must be selected for axes, which enables concentration on the main tasks. Ggplot2 is also often used as the basis for libraries that offer more complex graphics.

However, there are some limitations and things that should not be done with this package, including three-dimensional or interactive graphics.

Ggvis

This R library is designed to create visualizations of a type similar to ggplot2, but in an interactive webkey. As the backend for the visualization, ggvis uses Vega, which in turn is based on D3.js, and for the interaction with the user the package uses the extension R from Shiny and the Dplyr grammar of the data transformation.
Limitations include the inability to perform complex interactions like turning certain layers on and off, switching between records, and so on, and the need to be connected to an ongoing R session that is not that big for publication.

Plotly

Plotly is a JavaScript and D3.js library with an R API. It has extensive functionality and a wide variety of charts available (line charts, scatter plots, area charts, bar charts, histograms, heat maps, 3D charts, etc.) and is similar to ggplot2 syntax. You can create interactive charts directly from R or use the ggplotly () function, which converts charts created with ggplot2 into an interactive web-based version.

The visualizations and the data behind them can be viewed and changed in a web browser. Nevertheless, plotly also has the same disadvantage as ggvis - the requirement for the execution of R sessions.

DataTables

DataTables Package (DT) enables the creation of searchable and interactive tables with a minimum of code and effort. With an R interface to the JavaScript DataTables library, DT can convert R data in the form of matrices or data frames in tables on HTML pages.

Some of the most useful functions that can be implemented in tables are filtering, pagination, sorting, and many more. You can also design, edit, and select the options for displaying the table.

Machine learning

Caret

The Classification and Regression Training (CARET) package is a set of tools that help carry out various machine learning tasks, from data partitioning and preprocessing to building predictive models and estimating their performance. In other words, the library combines powerful functions and algorithms for model training and prediction. There are 238 models available. However, they are all for regression and classification only. Many common metrics are implemented in the library, but you can also write your own quality metrics and wrapper methods for models. In addition, caret is well integrated with other algorithm-specific packages.

Gbm

Gradient boosting is a machine learning technique for regression and classification problems that takes advantage of the idea of ​​building a predictive model that combines several weak models, mostly decision trees. The influencing increase in performance makes the increase in gradient an outstanding property among the most powerful predictive tools that you can use in machine learning. The gbm package (Generalized Boosted Regression Models) implements an extension of Freund and Schapire's AdaBoost algorithm and Friedman's gradient enhancement machine. It provides tools for fast modeling, variable selection, and final stage precision modeling that ensure robust and competitive performance. The GBM includes regression methods for least squares, absolute loss, t-distribution loss, quantile regression, Poisson, Cox proportional hazards, partial probability, AdaBoost exponential loss, Huberized hinge loss, and learning to rank measurements.

RandomForest

Another very popular, powerful, and versatile machine learning algorithm is Random Forest. It takes advantage of the idea of ​​combining multiple decision trees to build a stronger model and improve generalizability for classification and regression tasks. An implementation of these algorithms is included in the randomForest R package. This library, which groups different observations in a decision tree, uses common results obtained for the maximum of the observations to make a final prediction.

It is important to note that randomForest works with numeric or factor variables. It can also be used in unattended mode to assess the proximity between data points.

Xgboost

Last but not least, the Extreme Gradient Boosting (xgboost) library is on our list, which is an implementation of the Gradient Boosted Decision Trees algorithm in the R interface. Here, too, the aim is to build an ensemble of successively refined elementary models that can find an answer to monitored machine learning problems. It has both a linear model solver and tree learning algorithms. The most useful functions include regression, classification and ranking, which you can use directly, or you can define your own function. The package is evaluated for parallel computing, activated cross-validation and regularization. All these properties make xgboost exceptionally high in predictive power and deliver very good speedometers.

Summary

These 10 packages have proven to be very effective and helpful for many computer scientists and especially for our team. They are used to solve various complicated problems in various fields and to find answers to the wide variety of scientific questions.

Of course, this is a subjective list; there are many other valuable R libraries available.