Mind Your Language
  • 1st August, 2017

Mind Your Language

The musical Hamilton, currently playing on Broadway, re-enacts the 1804 duel in which the former Secretary of the Treasury, Alexander Hamilton, is fatally wounded by his political rival and Vice President of the United States, Aaron Burr. In the 17th and 18th centuries, duels were mostly fought using swords although pistols were later introduced. Fast forward to the 21st century and the choice of weapon among data scientists is invariably R or Python, both hugely popular programming languages.  

Perhaps the most likely question to arise on a first encounter between data scientists is, “do you use R or Python?”. Depending on your answer, you will be met with either a warm embrace followed by mutual fawning at one another’s good taste, or a scoff and a patronizing retort justifying your choice because of your inferior upbringing. This scenario may seem a bit hyperbolic, but I have had this encounter numerous times myself when the inevitable question about my preferred programming language for data science arises.

Keeping the faith
R and Python are by far the most popular languages used by professional data scientists to wrangle data, build models, create visualizations and communicate results. Many data scientists have a near-religious predilection to one of these two languages as their tool of choice, and will defend the honor of their preferred language to the bitter end. R was released in 1995 and describes itself as a “software environment for statistical computing and graphics”. Python, first released in 1991, is a more general purpose programming language that emphasizes code readability. Though both have been in existence for over two decades, it is only recently that their popularity has begun to explode. Both routinely feature in rankings of top 10 programming languages by popularity and often come out on top when languages are ranked by their growth.

For most data scientists, our preference for R or Python is often not determined by reason or suitability to the tasks we routinely use them for, but rather our educational background or previous career experience. Because most data scientists working in the wild currently have moved into their career laterally from backgrounds in science, engineering and programming, rather than having been trained directly as data scientists, this history is often the principal determinant of whether they prefer R or Python. Typically those with a background in natural and biological sciences or statistics will most likely have been trained to use R for analytical tasks and data handling, whereas those who come from the physical sciences, engineering or computer science will be more familiar with Python. 

Choose your weapon
When it comes to differences in capabilities, there is a lot of debate about which language is better for which task, but the reality is that you can do just about any task required in a data science workflow in either package. Whether it be data ingestion or building jazzy web applications, you can do it all with both R and Python. However, because of R’s history in statistics and sciences, and Python’s provenance as a more general purpose programming language, each language excels in different realms. 

R was purpose built to handle the familiar data structure we encounter in our daily data life, therefore it really excels at data wrangling and data exploration. Its popularity with statisticians means that the latest implementations of cutting edge statistical techniques often appear first as packages for R. Python’s popularity with developers has led to the emergence of great frameworks for building web applications such as Django and Flume. Production level deployments of data science insights are often best implemented with Python.

The debate over which language is better, R or Python, is futile. They excel in different areas. A data scientist working at all steps of the data science value chain will need to be at least competent  in both, and excel in one. The collaborative nature of data science projects means that you will someday soon need to put your prejudices aside, and work in the language that you are not most comfortable with. Better get started learning now to make that process more pleasant.

Share This Artcle :

About Author

Glenn’s PhD research resulted in several published papers and considerable recognition in the field of ecological modelling. He recently worked as a research associate for SAEON (South African Earth Observation Network) collating multiple disparate datasets to form better views of the drivers of ecological change over time. Glenn brings a rigorous statistical background, expertise in cutting-edge modelling and excellent research and data skills.