Who uses R language at all

Python for R programmers

Python has experienced a massive boom in recent years, especially due to its intensive use in the areas of data science, machine learning and deep learning. Even if R is growing in a similar form, it can always be an advantage for professional users to know about several languages. In the following, R programmers will be given an introduction to conceptual similarities and differences between the two languages. Conversely, these are of course also of interest to Python experts who want to deal with R.

Similarities:

Python and R are interpreted languages

Python is a fairly classic programming language, but originally emerged as a Unix hacking language, among other things. It is interpreted. So you don't write code that is compiled out (such as with C, Java, Scala), but a script that is processed point by point, as in MatLab, or R. Functions and classes can be loaded from other scripts or packages that are either located in a specified folder or installed through the packaging system. Furthermore, with an interpreted language it is not necessary to use static typing, i.e. to explicitly specify the type of an object during initialization (integer, double, string, array, etc.). In this point, Python does not differ from R. This is also an essential point why Python can be used for data analysis and statistical models at all: You write a line of code, send it to the Python console, as you do in R the R console does. This makes it possible to work interactively with the language. You can look at data, graphics and results between each step, just like in R. This enables simple exploratory data analysis as well as convenient model development. This is not easily possible in compiled languages! Here you have to write code, compile it out, and only then can you look at the results. The intermediate step of error-free compilation complicates data analysis immensely, at least if the compilation does not take place at runtime.

Python and R have an extensive universe of open source packages and an active community

Python and R have systems for easily creating, distributing, and managing packages. This and the general success of both languages ​​lead to extensive package repositories (CRAN or PyPI), which are constantly growing and expanding. Both languages ​​are used intensively and are growing, which on the one hand leads to this large number of packages, but also to excellent community support. Python is now even the leader in StackOverflow in terms of the number of questions / answers.

Python and R are a couple of years under their belt

The development of Python began in 1989 (Fun Fact: as a hobby project about the Christmas holidays), that of R in 1992. That explains why the two languages ​​already have such a large bouquet of packages and such a large community. And that also explains why they both bring a few concepts with them that come from the early days and may seem a bit incomprehensible today. However, both languages ​​are also being actively developed.

Python and R are slow

Python and R are interpreted and they are not entirely new. This combination leads to massive speed disadvantages, for example in comparison to a compiled language such as C or a modern language compiled at runtime without static typing such as Julia, which cleverly combines the advantages of interpreted and compiled languages. Using libraries such as Cython or Rcpp, this disadvantage can be compensated for to a certain extent in both languages ​​by outsourcing parts of the code to compiled languages.

Differences:

R is a data analysis language that you can program with - Python is a programming language that you can use to do data analysis

Python emerged as a new scripting language in the early 1990s with no data analysis in mind. Additional functionality that made Python the data science language it is today wasn't added through packages until much later. It was not until the mid-1990s that packages with a focus on numerical programming were offered for the first time. NumPy, which is so popular today, was published in the first version in 2006, pandas in 2008, scikit-learn in the first version not until 2010. For R users, this leads to the unusual situation that Python can even do simple matrix calculations or linear Models have to load quite a bit of packages, and that there are no operators for such calculations. For a simple matrix multiplication of the matrices A and B, one conventionally writes for example

instead of "simple" (as in R)

Especially at the beginning, the constant use of the abbreviation "np" for numpy can be a bit exhausting. For all functions for data analysis you have to take these detours via packets, which R often already exist in Base-R. Conversely, there are many inconveniences and unusual features that occur in R in classic programming tasks, but not in Python. In addition, the fact that Python is a "general purpose" language has many advantages because it is even easier to integrate into other program parts and many classic programming concepts are more fundamentally implemented in Python than in R. An example of this is OOP:

R is (among other things) functional, and a little bit object-oriented - Python is (among other things) object-oriented, and a little bit functional

R is functional. As a user, this is noticeable in the much higher speed of all types of apply () compared to simple for loops. Developers who refuse to use functions are likely to have problems writing efficient code relatively soon. In Python, some functional programming concepts were implemented relatively early on. There are, for example, the classic functional programming tools lambda, map, filter, reduceas they also exist in R. Python is also famous for its list and dict comprehensions. The way they work is similar to a apply on an R-Vector.

This is essentially a syntactic, very elegant simplification of map and filter. In R one would work with vectorization and subsetting:

Comprehensions can greatly simplify and shorten many things. They are considered "Pythonic" style, but are not much faster than a classic For loop. Even when working with Data.Frames from the pandas world, you cannot generally say that apply is faster than iterating over the axes! This is a big difference to R, where iteration over the axes is "forbidden". Other functional programming concepts, such as those in R, are also available in Python, such as closures. However, Python is not fully functional. There is a lack of tools for functions such as those in purely functional languages ​​such as Haskell. For example, there is no classic pattern matching (but also not in R), and there is no large number of tools for functions. Meanwhile, R is essentially functional, which is what you like about constructions

notices a notation that would be impossible in Python. Python has a stronger focus on classic object-oriented paradigms, which are rather unusually reproduced in R. According to the current status, R has four concepts: S3, S4, Reference Classes and R6. The first two implement a completely different concept of classes; the latter two are much closer to the concepts commonly used in Python (and other languages).

Here is a small class and its application:

A class INWT is defined here and lists with some people are added as an object. There is also a print method that outputs all people. All objects can be reached and changed from the outside via a point as a separator ('Julia' is added without further ado), the only convention is inwt.__boss private (indicated by the double bar). Setters and getters are not needed (unlike e.g. in Java). Everything that belongs to the class is written down within the class definition. Print, on the other hand, only provides information about the object and its pointer.

All in R as S3 class:

Not everything is given here when defining and initializing the class. The "class" can be added later, and generic methods can also be added later. Objects can also be changed. This (and the S4) system is fundamentally different than in Python, and also comes from the functional point of view. Common methods work differently for each class and can be defined individually. This is not the case with Python, where each method can only be reached within an initialized class. These differences are not so great with reference classes; there, too, all parts within the class are defined, but also typified. Overall, it can be said that object orientation is much more deeply integrated into the Python philosophy and that as a developer you cannot avoid these concepts. But the S3 and S4 systems are also very popular in R and are actively promoted.

Data science is more consolidated in Python and more flexible in R.

R offers several packages with different philosophies for almost every task, e.g. for data handling base, dplyr, data.table etc. At the same time, there is an enormous number of models that are available in R packages, but which have a completely different syntax can. This is not so strongly the case in Python. De facto, only pandas is used for data handling; for modeling, scikit-learn and its syntax have become established to a very large extent. New models very often offer a wrapper for this. The syntax for definition, modeling and prediction is the same for all models within scikit-learn!

The syntax for definition, fitting and prediction of a "model" with data "X" and "y" (as numpy arrays) and new data "X_hat" looks like this:

This is the same for ALL models offered! At the same time, scikit-learn offers a wide range of additional methods for data handling (e.g. missing values, scaling), for model fitting (e.g. cross-validation methods, grid search) and for prediction and deployment (e.g. pipelines). That's all a Package and contains many models. This is not available in a comparable form in R or at most in packages such as caret, which, however, represent an overhead for the deployment. For this one can choose from a large number of different approaches in R. Nevertheless, the strong consolidation in Python is quite convenient; a syntax was able to establish itself within a very short time. However, this is not (yet) the case in the deep learning area, for example, to the extent that different approaches and frameworks (keras, tensorflow, PyTorch, etc.) compete with one another.

Conclusion

Python and R have become standard as data science languages. Both have in common that they are interpreted open source languages ​​with a strong, active community, can boast a flourishing package system including high-performance data handling and modeling packages, and are actively being further developed. The differences between the two languages ​​are due to their different histories, but are not a major hurdle for data scientists who want to think outside the box.