Which is easier to learn R.

R or Python

My blog post is aimed at data science beginners who are faced with the choice of which programming language they want to learn first. At STATWORX, we work with the two most popular languages, R and Python. Both languages ​​have their strengths and weaknesses, which is why you should ideally master both. To get started, we recommend learning one language and then continuing your education in the other. To make it easier to decide which programming language to start with, I'll introduce you both and then compare them with each other.

Overview of R and Python

Both Python and R are open source programming languages. This means that the source codes are publicly available and can be used free of charge. While Python is a general purpose programming language, R was developed for statistical analysis. Therefore, the users of the languages ​​often have different backgrounds. Generally speaking, one can say that software developers use Python and statisticians R.

R.python
publication19931991
developerR core teamPython Software Foundation
Package managementCRANConda (recommended for beginners)

A wealth of extensions

Both languages ​​have a basic set of functions that can be expanded with packages.

The Comprehensive R Archive Network (CRAN) is a platform for R packages. In order to provide a package on CRAN, a number of guidelines must be followed. In this way, CRAN guarantees that all packages that are available for download there actually work. A total of 10,000 packages are available on CRAN. Since R is the standard language for statisticians, you can find a suitable solution in CRAN for almost every problem in the field of statistics. So it is exactly the right place to go for the latest statistical methods and analyzes.

Python has two package management platforms: conda and PyPI (Python Package Index). There are also over 10,000 packages for Python which, in contrast to R, cover a very wide range of applications. Since complications can arise when Python packages are installed globally, virtual environments are used for this. They ensure smooth processes within the various packages and when there are dependencies from package to package. It is therefore not so easy for beginners to find their way around.

With the help of packages it is possible to execute Python code in R and vice versa. If you are interested, check out the blog post from my colleague Manuel. He puts the package reticulate in front.

IDEs as an aid

Programmers often use an integrated development environment (IDE), which makes their work easier with small but fine tools.

For R users, RStudio has established itself as the standard IDE. The IDE is distributed by the company of the same name, which is behind R commercially. RStudio not only offers a pleasant working environment, but also actively develops packages and extensions for the R language. For example, important packages such as tidyverse, packrat and devtools as well as popular extensions like shiny (dashboards) and RMarkdown (reports).

Python users have the choice between different IDEs (PyCharm, Visual Studio Code, Spyder, ...). However, there is no company behind Python that can be compared to RStudio. Nevertheless, thanks to the efforts of the huge community and the Python Software Foundation, new extensions are constantly being put together for Python.

The art of data visualization

The most used packages for data visualization with Python are matplotlib and seaborn. Dashboards can be created in Python with dash.

But R has a trump card up its sleeve when it comes to data visualization: the package ggplot2that is on the book The Grammar of Graphics based by Leland Wilkinson. With this package you can create appealing and tailor-made graphics, which you can make accessible to others on dashboards with the help of shiny.

Both programming languages ​​offer the ability to easily create beautiful graphics. Nevertheless, the R package is convincing ggplot2 with its flexibility and its visual possibilities.

Plus points for readability

Python was themed Readability counts designed. So even people who are not familiar with the programming language can interpret what is being done in the code.

This is probably not the case in R Code. The language is less intuitive than Python. Because of its good readability, Python offers an easier introduction to programming.

Speed ​​in different observation sizes

Next, I'll compare how long it takes to create a simulated data set in R and Python. For a fair comparison, the conditions should be the same as possible. The data is with the packages Xy and XyPy simulated in R and Python, respectively. For the timekeeping I have microbenchmark in R and timeit used in Python. In order to generate the simulation as quickly as possible, the process is parallelized on eight cores (R: parallel, Python: multiprocessing).

For the experiment, a data set with 100 observations and 50 variables is simulated 100 times. The time that the computer needs to carry out the simulation is measured individually for each simulation. And that is then repeated for 1,000, 10,000, 100,000 and 1,000,000 observations.

The R and Python code snippets are shown below.

The average duration, sorted by data set size, is shown for R and Python in the lower plot. The X-axis is shown here on a logarithmic scale with base 10 to make the graph clearer.

While R is a bit faster with a data set size of 100 and 1,000 observations, Python R soon depends significantly on it.

For further comparisons I can recommend the following STATWORX blog posts: pandas vs. data.table and pandas vs. data.table part 2, with the focus on data manipulation.

The standard in deep learning

If you are primarily interested in deep learning methods, Python is a better language. Most of the deep learning libraries were written and implemented in Python.

Deep learning is also possible in R, but the R deep learning community is much smaller. Implementations such as Keras and TensorFlow can also be called in R, but this is done using third-party packages. The packages therefore do not offer full flexibility for users, e.g. not all TensorFlow functions are available. Then there is the aspect of speed. Deep learning with Python is faster than with R.

Survey in the community: What makes users tick?

As a budding data scientist, Kaggle is an important platform for you. There you can take part in exciting machine learning competitions, experiment yourself and learn from the experiences of the community.

In 2018, Kaggle carried out a Machine Learning & Data Science survey. The survey was online for two weeks and received a total of 23,859 responses. From the results of this survey I have created various plots from which some interesting conclusions can be drawn with regard to my blog topic. The code for the individual plots is publicly available on Github.

Excursus: Python & R compared to other languages

Before we dive into R and Python, let's see how the two compare to other programming languages. Each survey participant indicated which language they primarily use. The lower plot was aggregated by language and the result is: The vast majority of participants mainly use Python! Followed by R in second place. In this survey, we do not differentiate between the work areas, which is why Python - as a general purpose programming language - probably stands out so much.

The comparison of R & Python

A direct comparison between R and Python shows that a large number of R users also use Python. Whereas Python users often only work with Python.

If you compare the use of the languages ​​by work area, you can see a clear dominance of Python. In all fields of work, with the exception of statisticians, the majority of Python is used.

The participants were also asked: What language do you recommend that prospective data scientists learn first? The answers to the question are summarized in the table below.

language recommendationUsersdifference
python14.1818.1806.001
R.2.3422.046296
SQL9141.211-297
C ++339739-400
Matlab256355-99
Java184903-719
Scala74106-32
Javascript72408-336
SAS69228-159
VBA38135-97
Go2646-20
Other16111744

If you compare the number of recommendations and the number of users, you can see that R and Python are the only languages ​​that have a positive difference.

In this question, too, Python (14.181) is again far ahead of R (2.342).

Conclusion

One thing first: both languages ​​are very powerful. Therefore, one cannot make a wrong choice! The choice of language depends on which projects you want to realize.

As a universal programming language, Python is suitable for various areas of application. Which is why I generally recommend starting with Python. But if statistical evaluations or data visualizations are in the foreground in your projects, R has an advantage over Python.

As already mentioned, both languages ​​have their advantages and disadvantages. As an advanced data scientist, you should ideally be able to speak both languages.

I hope that this article will help you find the right entry into the world of data science.

Happy coding!

If you are interested in training, you are welcome to look through our course catalogs for R and Python under STATWORX Academy.

credentials

About the author

Fran Peric

Being confronted with challenging riddles is what I like about my job as a data scientist at STATWORX. Lets get creative!


STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog (at) statworx.com.