Will RStudio ever fully support Python

Do you need to learn the R language? R programming language

Statistical analysis is an essential part of scientific research. High quality data processing increases the chances of publishing an article in a reputable journal and bringing the research to an international level. There are many programs out there that can provide high quality analysis, but most are paid for, and often a license costs hundreds of dollars or more. But today we are going to talk about a statistical environment that does not have to be paid for, and whose reliability and popularity rivals the best commercial statistics. Packages: We get to know R!

What is r?

Before giving a clear definition, it should be noted that R is more than just a program: it is both an environment and a language and even movement! We will look at R from different angles.

R is a computing environmentdeveloped by scientists in data processing, mathematical modeling and graphics. R can be used as a simple calculator, you can edit tables with data in them, you can do simple statistical analysis (e.g. t-test, ANOVA or regression analysis) and more complex time-consuming calculations, test hypotheses, make vector diagrams and make maps. This is not an exhaustive list of the possibilities in this environment. It should be noted that it is distributed free of charge and can be installed on both Windows and UNIX-class operating systems (Linux and MacOS X). In other words, R is a free and cross-platform product.

R is a programming language, thanks to which you can write your own programs ( scripts) as well as use and create special extensions ( Packages). A package is a collection of functions, help files, and examples that are grouped together in a single archive. Packages play an important role as they are used as additional R-based extensions. Each package is usually dedicated to a specific theme, for example: the "ggplot2" package is used to create beautiful vector diagrams of a particular design, and the "qtl" package is ideal for genetic mapping. There are currently over 7000 such packages in the R library! All of them are checked for errors and are in the public domain.

R is a community / movement.

Because R is a free, open source product, development, testing, and debugging is not done by a separate company with employees, but by the users themselves. Over the past two decades, the core of the Developers and enthusiasts formed a huge community. According to the latest data, more than 2 million people have helped in one way or another to develop and promote R on a voluntary basis, from translating documentation to creating training courses to developing new applications for science and industry. There are a variety of forums on the internet where you can find answers to most R questions.

What does the R environment look like?

There are many "wrappers" for R, which can vary greatly in appearance and functionality. However, we will briefly cover only three of the most popular options: Rgui, Rstudio, and R, which run as the command line in a Linux / UNIX terminal.

Rgui is the standard GUI (https://cran.r-project.org/) that is integrated into R by default. This shell is in the form of a command line in a window called console... The command line works on a question-and-answer basis.

For example:
\ u003e 2 + 2 * 2 # our question / request
6 # computer response

However, to write a complex command algorithm in Rgui, there is an additional one script windowwhere the program (script) is written. The third element of this shell is graphics module, is displayed when you want to view graphs.

The following figure shows the full version of Rgui: console (left), script window and graphics module (right).

Rstudio - Integrated development environment (IDE) (https://www.rstudio.com/). In contrast to Rgui, this shell has predefined areas and additional modules (e.g. command history, work area). According to some users, Rstudio has a more user-friendly interface that makes working with R easier. A number of features like highlighting of colors and automatic code completion, easy navigation through the script and others make Rstudio attractive not only for beginners but also for experienced programmers.

R in the Linux / UNIX terminal. This option is preferable when analyzing large amounts of data about a server, supercluster, or supercomputer. Most of them run on Linux / UNIX-class operating systems accessed through a command terminal (e.g. Bash). R in a terminal is a command line application (you can practice).

The R language in the world of statistical programs

There are currently dozens of high-quality statistics packages among which SPSS, SAS and MatLab are the clear market leaders. Despite stiff competition, R became the most widely used statistical analysis software in scientific publications in 2013 (http://r4stats.com/articles/popularity/). In addition, R has become increasingly in demand in business over the past decade: giants such as Google, Facebook, Ford and the New York Times are actively using it to collect, analyze and visualize data (http: //www.revolutionanalytics ) .com / companies that use r). To understand the reasons for the R language's growing popularity, we should pay attention to its common features and differences from other statistical products.

In general, most statistical tools can be divided into three types:

  1. gUI programsbased on the "click here, here and get the finished result" principle;
  2. statistical programming languageswhat basic programming skills are required;
  3. "mixed"which also have a graphical user interface ( GUI) and the possibility of creating script programs (e.g. SAS, STATA, Rcmdr).

Functions of programs with GUI

Programs with a graphical interface look familiar to the average user and are easy to learn. However, they are not suitable for solving non-trivial tasks because they have limited statistics. Methods and it is impossible to write your own algorithms in them. The mixed type combines the convenience of a GUI shell and the power of programming languages. In a detailed comparison of the statistical functions with the programming languages ​​SAS and STATA, however, both R and MatLab lose (comparison of the statistical methods R, MatLab, STATA, SAS, SPSS). Also, you have to pay a reasonable amount to license these programs, and the only free alternative is Rcmdr: a GUI wrapper for R (Rcommander).

Comparison of R with the programming languages ​​MatLab, Python and Julia

Among the programming languages ​​used in statistical calculations, R and Matlab take the leading positions. They are similar in appearance and functionality. However, they have different user lobbies that determine their specifics. In the past, MatLab has focused on the applied sciences of engineering specialties, so its strengths are matte. Simulation and computation, and it's much faster than R! However, since R was developed as a narrow profile language for statistical computing, there are many experimental statistics. Methods appeared and were fixed in it. This fact and the zero cost made R an ideal platform for developing and using new packages for basic science.

Other "competing" languages ​​are Python and Julia. In my opinion, as a universal programming language, Python is better suited for computing and gathering information using web technologies than for statistical analysis and visualization (the main differences between R and Python are well described). But the statistical language Julia is a fairly young and demanding project. The main feature of this language is its computational speed, which in some tests exceeds R by 100 times! While Julia is in the early stages of development and has few additional packages and followers, Julia may be R.'s only potential competitor in the long run.


This means that R is currently one of the world's leading statistical instruments. It is actively used in the disciplines of genetics, molecular biology and bioinformatics, environmental science (ecology, meteorology) and agriculture. R is also increasingly used in medical data processing, displacing commercial packages such as SAS and SPSS from the market.

Advantages of the R environment:

  • free and cross-platform;
  • rich arsenal of stat. Methods;
  • high quality vector graphics;
  • more than 7000 checked packages;
  • flexible to use:
    - allows creating / editing scripts and packages,
    - interacts with other languages ​​like C, Java and Python,
    - can work with data formats for SAS, SPSS and STATA;
  • active community of users and developers;
  • regular updates, good documentation and technology. Support.


  • a small amount of information in Russian (although several training courses and interesting books have appeared in the last five years);
  • relative difficulty of use for a user unfamiliar with programming languages. This can be partly mitigated by working in the Rcmdr GUI shell, which I wrote about above. However, you will still need to use the command line for non-standard solutions.

List of useful sources

  1. Official website: http://www.r-project.org/
  2. Starter Site: http://www.statmethods.net/
  3. One of the best reference works: The R Book, 2nd edition by Michael J. Crawley, 2012
  4. List of available literature in Russian + good blog

Robert I. Kabakov “R in action. Analysis and visualization of data in the program R "DMK Press, 2014, 588 pages (4.39 mb. Pdf)

R Programming language and software environment for statistical analytical calculations and visualizations. It introduces a variety of methods for analyzing and processing data. It is suitable for any task in the field and works with all major operating systems, with support for thousands of specialized modules and utilities. This feature of R makes it an indispensable tool for finding and extracting statistical data from a wide variety of information.

The featured book is an R language textbook that describes more than nine dozen commonly used packages for practical use. The manual gives typical examples of R.'s statistical functions. It describes methods of processing blocks of information with confusing and incomplete data and data whose distribution differs from normal and which are difficult to identify using conventional processing methods. In addition to statistical analysis, you can create beautiful and complex charts for the visual representation of data. ISBN 978-5-947060-077-1

PART I. Getting started 27

Chapter 1. We introduce: R. 30
1.1. Why use R? 32
1. 2. Reception and installation of R 35
1.3. Working in the R 35
1.3.1. Getting started 36
1.3.2. How To Get Help 39
1.3.3. Work area 40
1.3.4. Entry and exit 43
1.4. Packages 44
1.4.1. What are packages? 44
1.4.2. Install package 46
1.4.3. Download package 46
1.4.4. . Obtaining Package Information 46
1.5. Batch processing 47
1.6. Use output as input - reuse results 48
1.7. Working with Large Amounts of Data 49
1.8. Learn from example 49
1.9. Summary 51

Chapter 2. Create a data set 52
2.1. What is a data set? 53
2.2. Data structures 54
2.2.1. Vectors 55
2.2.2. Matrices 56
2.2.3. Data fields 58
2.2.4. Data tables 59
2.2.5. Factors 63
2.2.6. Lists 65
2.3. Data entry 67
2.3.1. Keyboard data entry 68
2.3.2. Importing data from a text file separated by 69
2.3.3. Import data from Excel 71
2.3.4. Importing data from XML files 72
2.3.5. Extracting Data from Web Pages 72
2.3.6. Importing data from SPSS 72
2.3.7. Importing data from SAS 73
2.3.8. Import data from Stata 73
2.3.9. Import data from netCDF 74
2.3.10. Import data from HDF5 74
2.3.11. Importing data from database management systems 75
2.3.12. Importing data with Stat / Transfer 77
2.4. Annotating Records 77
2.4.1. Variable labels 78
2.4.2. Explanation of the variable values ​​78
2.5. Useful functions for working with objects 79
2.6. Summary 80

Chapter 3. Getting started with charts 81
3.1. Working with Charts 82
3.2. Simple example 84
3.3. Graphics options 86
3.3.1. Symbols and lines 87
3.3.2. Colors 88
3.3.3. Features of text 90
3.3.4. Chart and margin sizes 93
3.4. Add text, set axis parameters
and symbols 95
3.4.1. Headings 95
3.4.2. Axes 96
3.4.3. Reference lines 99
3.4.4. Legend 100
3.4.5. Notes 102
3.5. Combine diagrams 105
3.5.1. Full control over the layout of 110 diagrams
3.9. Summary 112

Chapter 4. Basics of data management 113
4.1. Working example 113
4.2. Create new variables 116
4.3. Recode variables 117
4.4. Renaming Variables 119
4.5. Missing values ​​121
4.5.1. Recode values ​​to missing 122
4.5.2. Exclude missing values ​​from the analysis 122
4.6. Calendar dates as dates 124
4.6.1. Converting Data to Text Variables 126
4.6.2. Further information 126
4.7. Converting data from one type to another 127
4.8. Sort data 128
4.9. Combine records 129
4.9.1. Add Columns 129
4.9.2. Add lines 130
4.10. Dividing data sets into components 130
4.10.1. Select variables 130
4.10.2. Exclude variables 131
4.10.3. Selection of observations 132
4.10.4. Subset () function 133
4.10.5. Random Samples 134
4.11. Using Commands to Convert Data Tables 135
4.12. Summary 136

Chapter 5. More sophisticated methods of data management 137
5.1. The Data Management Challenge 138
5.2. Number and text functions 139
5.2.1. Math functions 139
5.2.2. Statistical Functions 140
5.2.3. Distribution functions 143
5.2.4. Text functions 148
5.2.5. Other useful functions 149
5.2.6. Applying Functions to Matrices and Data Tables 151
5.3. Solving our data management challenge 152
5.4. Command Execution Control 157
5.4.1. Repetition and Cycles 158
5.4.2. Execution under condition 159
5.5. User written functions 160
5.6. Aggregating and Modifying Data Structures 163
5.6.1. Transpose 163
5.6.2. Data aggregation 164
5.6.3. 165 Reshape the package
5.7. Summary 167

PART II. Basic Methods 169

Chapter 6. Basic diagrams 171
6.1. Column Charts 172
6.1.1. Simple column charts 172
6.1.2. Column Charts: Stacked and Grouped 174
6.1.3. Bar charts for average 175
6.1.4. Optimizing Column Charts 177
6.1.5. Spinograms 178
6.2. Pie Charts 179
6.3. Histograms 182
6.4. Nuclear Estimation Plots for Density Function 185
6.5. Swing Charts 188
6.5.1. Using Swing Charts to Compare Groups with Each Other 189
6.5.2. Violin Charts 193
6.6. Scatterplots 194
6.7. Summary 197

Chapter 7. Basic methods of statistical data processing 198
7.1. Descriptive Statistics 199
7.1.1. Methods 200 kaleidoscope
7.1.2. Compute Descriptive Statistics for Data Groups 204
7.1.3. Results visualization 208
7.2. Frequency and Contingency Tables 208
7.2.1. Creating Frequency Tables 209
7.2.2. Independence Tests 216
7.2.3. Relationship indicators 218
7.2.4. Results visualization 219
7.2.5. Converting Tables to Flat Files 219
7.3. Correlations 221
7.3.1. Correlation Types 222
7.3.2. Testing the Statistical Significance of Correlations 225
7.3.3. Visualize Correlations 228
7.4. Student Tests 228
7.4.1. Student test for independent samples 229
7.4.2. Student test for dependent samples 230
7.4.3. When there are more than two groups 231
7.5. Nonparametric Tests of Intergroup Differences 231
7.5.1. Comparison of the two groups 231
7.5.2. Comparison of more than two groups 233
7.6. Visualizing Group Differences 236
7.7. Summary 236

PART III. Medium data processing methods 237

Chapter 8. Regression 239
8.1. The Many Faces of Regression 241
8.1.1. OLS Regression Situations 242
8.1.2. What You Need To Know 244
8.2.OLS regression 244
8.2.1. Fitting Regression Models with the lm () Command 245
8.2.2. Simple linear regression 247
8.2.3. Polynomial regression 250
8.2.4. Multiple linear regression 253
8.2.5. Multiple linear regression with interactions 257
8.3. Diagnosing Regression Models 259
8.3.1. Standardized Approach 260
8.3.2. Refined Approach 264
8.3.3. General verification of compliance with the requirements for linear models 272
8.3.4. Multicollinearity 273
8.4. Unusual Sightings 274
8.4.1. Emissions 275
8.4.2. High voltage points 275
8.4.3. Influential Observations 277
8.5. Correction Methods 281
8.5.1. Deletion of Observations 281
8.5.2. Convert variables 281
8.5.3. Adding or Removing Variables 284
8.5.4. Try a different approach 284
8.6. Choosing the "best" regression model 285
8.6.1. Comparison of 285 models
8.6.2. Select variables 286
8.7. Analysis continued 291
8.7.1. Cross Validation 292
8.7.2. Relative importance 294
8.8. Summary 298

Chapter 9. Analysis of variance 299
9.1. Crash course in terminology 300
9.2. Fitting ANOVA Models 304
9.2.1. Aov () function 304
9.2.2. The order of the members in the formula 305
9.3. One Way ANOVA 307
9.3.1. Multiple comparisons 308
9.3.2. Checking the validity of the underlying assumptions
at the center of the test 312
9.4. One-way analysis of covariance 314
9.4.1. Testing the assumptions underlying test 316
9.4.2. Results visualization 317
9.5. Two Way ANOVA 318
9.6. ANOVA for Repeated Actions 323
9.7. Multivariate ANOVA 326
9.7.1. Testing the assumptions underlying Test 328
9.7.2. Robust multivariate ANOVA 330
9.8. ANOVA as regression 331
9.9. Summary 333

Chapter 10. Performance analysis 335
10.1. Summary of Hypothesis Testing Procedure 336
10.2. Performance analysis with pwr 339
10.2.1. Student Tests 340
10.2.2. ANOVA 342
10.2.3. Correlations 343
10.2.4. Linear models 344
10.2.5. Comparison of proportions 345
10.2.6. Chi-Square Tests 346
10.2.7. Choosing the right effect size in unknown situations 348
10.3. Graphical performance analysis 350
10.4. Other packages 352
10.5. CV 354

Chapter 11. Medium complexity charts 356
11.1. Scatterplots 357
11.1.1. Scatterplot Matrices 361
11.1.2. High-density scatterplots 367
11.1.3. 3-D scatterplots 370
11.1.4. Bubble Charts 375
11.2. Line Charts 377
11.3. Corelograms 382
11.4. Mosaic Cards 388
11.5. CV 391

Chapter 12. Resample statistics and bootstrap analysis 392
12.1. Permutation Tests 393
12.2. Permutation tests in the coin box 395
12.2.1. Independence tests for two and k samples 397
12.2.2. Independence in crosstabs 399
12.2.3. Independence between numeric variables 400
12.2.4. Tests for two and k dependent samples 400
12.2.5. Additional information 401
12.3. Permutation tests implemented in the lmPerm 401 package
12.3.1. 402 simple and polynomial regression
12.3.2. 403 multiple regression
12.3.3. One-way ANOVA and covariance 404
12.3.4. Two Way ANOVA 405
12.4. Additional Notes on Permutation Tests 407
12.5. Bootstrap Analysis 408
12.6. Bootstrap analysis with the Boot 409 package
12.6.1. Bootstrap Analysis for a Statistic 411
12.6.2. Bootstrap Analysis for Multiple Statistics 413
12.7. CV 416

PART IV. Advanced Techniques 417

Chapter 13. Generalized linear models 419
13.1. Generalized linear models and the function glm () 420
13.1.1. Glm () function 421
13.1.2. Auxiliary functions 423
13.1.3. Model Fitment and Regression Diagnostics 424
13.2. Logistic regression 425
13.2.1. Interpreting Model 428 Parameters
13.2.2. Assessing the Impact of Independent Variables on the Probability of the Outcome 430
13.2.3. Overdispersion 431
13.2.4. Additional methods 432
13.3. Poisson regression 433
13.3.1. Interpreting Model 436 Parameters
13.3.2. Overdispersion 437
13.3.3. Additional methods 439
13.4. CV 442

Chapter 14. Principal component and factor analysis 443
14.1. Perform principal component analysis and factor analysis in R 446
14.2. Principal Components 447
14.2.1. Selecting the Required Number of Components 449
14.2.2. Principal Component Isolation 451
14.2.3. Rotation of main components 455
14.2.4. Calculation of Principal Component Values ​​456
14.3. Exploratory Factor Analysis 459
14.3.1. Determine the number of extracted factors 460
14.3.2. Assignment of Common Factors 462
14.3.3. Rotation factors 463
14.3.4. Factor values ​​467
14.3.5. Other packages for factor analysis 468
14.4. Other models for latent variables 468
14.5. Continue 470

Chapter 15. Advanced techniques for dealing with missing data 472
15.1. Steps to Dealing with Missing Data 474
15.2. Find missing values ​​476
15.3. Examining the Structure of Missing Data 477
15.3.1. Display missing values ​​as a table 478
15.3.2. Visually examine the structure of missing data 479
15.3.3. Using correlation for research
missing values ​​482
15.4. Identify the sources of missing data and their impact 484
15.5. Rational Approach 486
15.6. Parsing Complete Lines (Line-by-Line Delete) 487
15.7. Recovering Method for Multiple Missing Data 489
15.8. Other approaches to missing data 495
15.8.1. Pairwise delete 496
15.8.2. Simple (non-stochastic) data recovery 496
15.9. Continue 497

Chapter 16. Advanced graphics techniques 499
16.1. Four R 500 graphics systems
16.2. Lattice 501 package
16.2.1. Conditional Variables 507
16.2.2. Functions for changing the format of cells 509
16.2.3. Grouping variables 512
16.2.4. Graphic parameters 518
16.2.5. Position of the diagrams on page 519
16.3. Ggplot2 package 520
16.4. Interactive Graphics 526
16.4.1. Interacting with Charts: Identifying Points 527
16.4.2. Game with 527 package
16.4.3. Latticist 529 package
16.4.4. Create interactive graphics with the iplots 530 package
16.4.5. Rggobi 532 package
16.5. CV 533
Epilogue: Hunt for the Rabbit 535

Appendix A.
Graphic user interfaces 539

Appendix B.
Setting up the initial configuration of the program 543

Appendix C.
Data export from R 545
C.1. Limited text file 545
C.2. Excel 545 table
C.3. Other statistical programs 546

Appendix D.
Saving results in publishable quality 547
D.1. Creation of a typographic quality report
Using the Sweave Package (R + LaTeX) 548
D.2. Collaborate with OpenOffice with odfWeave 554
D.3. Comments 557

Matrix algebra in R 558

Appendix F.
Packages mentioned in this book 561

Appendix G.
Working with Large Amounts of Data 570
G.1. Effective programming 571
G.2. Storing data outside of RAM 572
G.3. Big Data Analytics Packages 573

Appendix H.
Update version R 574
References 576
Package and function index 581

Download the book 4.39 MB for free. pdf

R programming language. Video

Four good reasons to try this open source data analysis platform

You've probably heard of R. You may have read the related article by Sam Siewert, entitled. You know R is a programming language and has something to do with statistics, but is it right for you?

The case for R.

R is a statistics-oriented language. It can be seen as a competitor to analysis systems like SAS Analytics, not to mention simpler packages like StatSoft STATISTICA or Minitab. Many professional statisticians and educators in government, trading companies and the pharmaceutical industry solve their problems with products such as IBM SPSS or SAS without writing an R code. Hence, it is largely a decision to learn, and the use of R is a matter of culture and professional preference for work equipment. I use different tools in my statistical consulting practice, but most of what I do is done in R. The following examples explain why this is the case.

  • R is a powerful scripting language. I was recently asked to analyze the results of a large-scale study. The researchers looked at 1,600 papers and coded their content according to multiple criteria - the number of criteria was really large, especially given the multiple variations and branches. When transferred to a Microsoft® Excel® spreadsheet, this data contained over 8,000 columns, most of which were blank. The researchers wanted to count sums across different categories and under different headings. R is a powerful scripting language and supports Perl-style regular expressions for word processing. The processing of disordered data requires knowledge of programming languages. SAS and SPSS products have scripting languages ​​for tasks that cannot be solved with a pull-down menu. However, R was created precisely as a programming language, so it is a more suitable tool for this purpose.
  • R is the direction guide. Many new developments in statistics initially appear as packages for the R platform ("R packages") and only then arrive on commercial platforms. I recently received data from a medical study on patient referrals. For each patient, this data included the number of treatment items suggested by the doctor and the number of items the patient actually remembered. The natural model for this situation is the so-called. beta binomial distribution... It has been known since the 1950s, but estimation methods that combine a model with the covariances of interest are only new. Such data are usually processed with the so-called. However, Generalized Estimating Equations (GEE) methods are asymptotic and assume that the sample is large. I needed a generalized linear model with a beta binomial distribution. One of the newer R packages is evaluated according to this model: package betabinomby Ben Bolker. SPSS does not have this feature.
  • Integration with document publishing tools. R integrates seamlessly with document publishing systems and enables statistical results and graphics from the R environment to be embedded in publication-quality documents. Absolutely not everyone needs this feature, but if you want to write a book about your data analysis or just don't want to copy your results into word processing documents, using R and LaTeX is the shortest and most elegant way to go.
  • Free I'm the owner of a small company so I love that R is free. Even for a larger company, it is not a bad thing if the right specialist is temporarily hired and can immediately provide such a specialist with a workstation with advanced analysis software. You don't have to worry about the budget.

What is R and what is it for?

140 character explanation

R is an open source implementation of the S language, a programming environment for data analysis and graphics.

As a programming language, R is similar to many other languages. Anyone who has ever written code will find many familiar moments in R. The peculiarities of R lie in the statistical philosophy that it professes.

The statistical revolution: S and exploratory data analysis

Computers have always been an efficient tool for computation - but only after someone has written and debugged a program to run the desired algorithm. However, in the 1960s and 1970s, computers were still very weak at displaying information, particularly graphical information. These technical limitations, as well as trends in statistical theory, have led statistical practice, like statistician training, to focus on building models and testing hypotheses. In this world, researchers have hypothesized, thought through experiments, optimized models, and performed tests. A similar approach is implemented in software tools such as SPSS which are spreadsheet based and menu driven. In fact, the first versions of the SPSS and SAS Analytics software consisted of subroutines that could be called by the main program (in Fortran or another language) to adapt and validate a model from an existing set of models.

In this formalized and theoretical setting, John Tukey threw the concept of the so-called into a glass case like a cobblestone. Exploratory data analysis (EDA). Nowadays, it's hard to imagine when it was possible to analyze a data set without using a box chart to check for skewness and outliers, or without checking the residuals of a linear model for normality using a quantum plot. The author of all these ideas was J. Tukey, and today no introductory course in statistics is complete without them. However, this has not always been the case.

Quote from the book: Graphical methods for data analysis (Graphical methods for data analysis)

"In any serious application, you need to look at the data in different ways, then draw multiple graphs and do multiple studies. That way, you can choose the next step from each step. For data analysis to be effective, it must be iterative." - John Chambers, see section).

EDA is more of an approach than a theory. To successfully apply this approach, the following rules of thumb must be followed.

  • Whenever possible, use graphs to explore the features of interest.
  • Always run the analysis incrementally. Try a model. Based on the results obtained, adjust the following model.
  • Check the model assumptions with graphics. Watch out for any emissions.
  • Use robust methods to neutralize deviations from distributional assumptions.

J. Tukey's approach has spawned a wave of new graphical methods and robust estimates. In addition, this approach initiated the development of a new software environment that focuses on exploration methods.

John Chambers, along with his colleagues at Bell Laboratories, created the S language as a platform for statistical analysis, particularly the version that J. Tukey explained. The first version of the S language, intended for internal use at Bell, was developed back in 1976, but it wasn't until 1988 that the language got its current form. At that point, the language was also available to users outside of Bell. In all respects, S corresponds to a "new model" of data analysis.

  • S is an interpreted language that works in a programming environment. The S syntax is similar to the C syntax, but it is not complex. For example, S takes care of memory management and variable declarations so that the user doesn't have to write or debug such things. With less programming, you can quickly run multiple studies on the same data set.
  • From the start, the S language made it possible to create high-level graphics artifacts and add options to any graphics window that was open. This language makes it easy to highlight POIs, query their values, add smooth curves to a scatter plot, and much more.
  • In 1992, object orientation was also implemented in the S language. In a programming language, objects structure data and functions according to the user's intuition. Human thinking is always object-oriented, especially statistical inference. The statistician works with frequency tables, time series, matrices, spreadsheets with different data types, with models, etc. In each case the raw numbers are assigned and accompanied by certain expectations. A time series consists, for example, of observations and corresponding points in time. Standard statistics and charts are expected for each data type. In the case of a time series, you can create a time series plot and a correlogram. Approximations and residuals can be recorded for an empirically adapted model. The S language allows you to create objects for all of these concepts. You can create new object classes as needed. Objects facilitate the transition from conceptualizing a problem to implementing it in code.

A language with character: S, S-Plus and hypothesis test

In its original form, the S language took Tukey's EDA methods very seriously - to the extent that it was impractical to do anything else in the S language. also EDA. It was a language with character. For example, S had a number of useful intrinsic functions, but it lacked some very obvious capabilities that one would expect from statistic software ... So there was no function to do a t-test on two samples and it did not become a real test for hypotheses of any kind. Despite J. Tukey's reasoning, hypothesis testing is often very useful.

In 1988, a Seattle-based company called Statistical Science S licensed and ported an improved version of that language called S-Plus to the DOS platform and then to the Windows® environment.With a realistic understanding of customer needs, Statistical Science has added classic statistical functions to the S-Plus language. Analysis of Variance (ANOVA) functions have been added, t-Test and other models. According to the object orientation of S, the result of an adapted model is itself an object of S. Calls to the corresponding function provide approximations, residuals and pValue when testing a hypothesis. The model object can even contain Zw like the QR decomposition of the design matrix (where Q is an orthogonal matrix and R is an upper triangular matrix).

There is an R package for every task! Open source community

Around the same time that S-Plus was released, Ross Ihaka and Robert Gentleman from the University of Auckland in New Zealand decided to write an interpreter. They chose the S language as their model. The project was specified and supported. They named their project R.

R is an implementation of the S language with additional models developed in the S plus language. In some cases, the same people were involved in models in both languages. R is an open source project available under the GNU license. On this basis, R evolves, mainly by adding packages. R- package is a collection of data sets, R functions, documentation and dynamically loaded elements in C or Fortran. The R package can be installed as a group that will be available in the R session. R packages add new functionality to the R language. With these packages, researchers can easily communicate calculation methods to their colleagues. Some packages are of limited scope, others represent whole areas of statistics, and some reflect the latest developments. Indeed, many new developments in statistics appear first as R packages and only then are implemented in commercial software products.

At the time I wrote this, there were 4,701 R packages on the CRAN website that had R downloaded from. Of these, six packages were added on that day alone. The R platform offers a package for solving any task - at least that's the impression you get.

What happens when using R.

Note: This article is not a tutorial on R. The following example is nothing more than an attempt to show what an R session looks like.

R binaries are available for Windows, Mac OS X, and various Linux® flavors. In addition, source codes are available for those who want to compile themselves.

On Windows®, the installation program adds an R item to the menu Beginning (Beginning). To run R on Linux, open a terminal window and type the letter R when prompted. You should see something similar.

Figure 1. Work area R.

Enter the command at the command prompt and R will respond accordingly.

In real life, you would likely be inserting data from an external data file into the R object at this point. R can read data in various formats. However, in this example I am using the michelson dataset from the MASS package. This package accompanies Venable and Ripley's landmark book entitled Modern applied statistics with S-Plus (Modern applied statistics with S-Plus) (see section). The michelson data set contains the results of the well-known Michelson-Morley experiments for measuring the speed of light.

Now let's look at the data (). The results are shown in FIG.

Listing 2. Box plot in R language
# Basic box plot with (michelson, box plot (Speed ​​~ Expt)) # I can add color and labels. I can also save the results as an object. michelson.bp \ u003d with (michelson, boxplot (speed ~ Expt, xlab \ u003d "Experiment", read \ u003d 1, ylab \ u003d "Speed ​​of light - 299,000 m / s", main \ u003d "Michelson-Morley experiments", col \ u003d "slateblue1" ")) # The current estimate of the speed of light on this scale is 734.5 # adding a horizontal line to highlight this value. abline (h \ u003d 734.5, lwd \ u003d 2, col \ u003d "purple") # Adds modern speed of light

It seems that Michelson and Morley systematically overestimated the speed of light. In addition, the experimental results show a certain heterogeneity.

Figure 3. Box diagram representation

If I'm satisfied with my research, I can save all of my commands as a single R () function.

Listing 3. Simple R function
MyExample \ u003d function () (library (MASS) data (michelson) michelson.bw \ u003d with (michelson, box plot (speed ~ Expt, xlab \ u003d "experiment", read \ u003d 1, ylab \ u003d "speed of light - 299,000 m / s "), main \ u003d" Michelsen-Morley experiments ", col \ u003d" slateblue1 ")) abline (h \ u003d 734.5, lwd \ u003d 2, col \ u003d" purple "))

This simple example shows some important features of the R.

Does R need powerful hardware?

I ran this example on an Acer netbook running Crunchbang Linux. R does not require a powerful computer for small to medium-sized analyzes. R has been considered a slow language for 20 years because it is interpretable and the amount of data it can analyze is limited by the computer's memory. All of this is true, but for modern computers this is usually not critical, provided the application isn't actually huge (i.e. it doesn't belong to the big data category).

R remains relevant in the 21st century

J. Tukey's exploratory approach to data analysis has become the norm for the educational process. It is taught in educational institutions and used by statisticians. The R language supports this approach, and that's one of the reasons it's popular to this day. Object orientation also helps keep the R language relevant as new data structures are required to analyze new data sources. InfoSphere® Streams currently supports R-Analysis for data other than that targeted by John Chambers.

R language and the InfoSphere Streams platform

InfoSphere Streams is a computing platform and integrated development environment for analyzing data flowing at high speed from thousands of sources. The content of these data streams is usually unstructured or partially structured. The purpose of analysis is to identify changing patterns in data and make decisions directly based on rapidly changing events. The SPL programming language for InfoSphere Streams organizes data using a paradigm that reflects the dynamics of data and the need for rapid analysis and response.

We've come a long way from spreadsheets and simple flat files of classic statistical analysis, but R is customizable. In version 3.1, SPL applications can transfer data to R and thus use the extensive R package library. InfoSphere Streams supports R analysis by creating appropriate R objects to retrieve the information contained therein tuple SPL (basic data structure in the SPL language). This allows InfoSphere Streams data to be passed to R for further analysis and the results returned to the SPL.

For which cases is R not suitable

To be fair, it should be noted that some things R do not do very well or not at all. In addition, R is not equally suitable for every user.

  • R is not a data store. The easiest way to enter data in R is to enter the data you want elsewhere and then import it into the R environment. An attempt was made to add a spreadsheet front end to the R environment, but this is not the case. Acquired popularity. The lack of spreadsheet functions not only makes data entry difficult, but also makes it difficult to visualize the data in R (as opposed to SPSS or Excel).
  • R makes general tasks difficult. For example, in medical research, the first step in data processing is to compile summary statistics for all variables and compile a list of missing answers and missing data. In SPSS, this process is literally three clicks away, but R doesn't have a built-in function to calculate this obvious information and then display it in tabular form. The code you need is simple enough to write yourself, but sometimes you want things like this to be done with a click of the mouse.
  • The process of learning the R language is not trivial. A beginner can open a menu-driven statistics platform and get a result in minutes. Not everyone wants to be a programmer in order to be an analyst, and maybe not everyone does.
  • R is open source. The R-Community is large, mature and active. Without a doubt, R is one of the most successful open source projects. As I said, the R language implementation is more than 20 years old, and the S language implementation is even more. It's a proven concept and a proven product. As with any other open source product, however, reliability depends on transparency. We believe in code because we can test it ourselves and because other people can test it and report bugs. The situation is different in a company project that takes responsibility for testing and validating its software product. At the same time, in the case of rarely used R packages, we do not have enough reasons to assume that these packages actually deliver correct results.


Do you need to learn the R language? Might not; necessaryIs too strong a statement. But is R a valuable tool for data analysis? No doubt. This language is specifically designed to reflect the way statisticians think and work. R strengthens good habits and improves analysis. In my opinion, this is a good tool for this type of work.