Brief introduction to programming languages#
Computers execute instructions to process data, and the way we communicate with a computer is through a programming language. Like any other language, programming languages have rules about spelling and grammar. In this course, we will focus on Python and briefly on R, but it’s important to note that there are many other programming languages as well.
Compiled vs Interpreted languages#
Depending on how languages communicate with the computer before executing the provided instructions, we can categorize programming languages into two types: compiled and interpreted languages. In a compiled language, the entire code is first translated (compiled) and stored into a separate executable file. Examples of compiled languages include Fortran, C, and C++. On the other hand, interpreted languages translate and execute code line-by-line, with Python and R being common examples.
A compiled language comes with a compiler, which converts source files into executable binary files. An interpreted language comes with an interpreter, which reads and executes the source files directly. This interpretation process often makes languages like Python and R less efficient than compiled languages. This is why many programs, such as video games, are written in languages like C, as they need to be efficient in utilizing computer resources. Compiled languages allow for explicit management of resources, such as memory, which is why it’s often said that programming in a compiled language involves working at a lower level compared to languages like Python, where an interpreter handles much of the resource management for you.
In data science (and science in general), unless your research question requires intensive use of resources, languages like Python and R are perfectly suitable. They allow you to address your research questions easily and rapidly. Moreover, the speed difference on modern computers is usually negligible. Finally, code in a higher-level language can be more readable, making your applications easier for others to understand.
Programming paradigms#
In computer programming, a paradigm refers to a fundamental style or approach to structuring and writing code. It’s essentially a way of thinking about programming problems and their solutions.
Different paradigms offer various methods to organize and manage code, each with its own set of principles, techniques, and conventions. Here are the main programming paradigms:
Procedural Languages: These languages follow a set of instructions or procedures to solve a problem. They are generally easy to understand and write. Examples include C, Fortran, Pascal and Python.
Object-Oriented Languages: These languages are based on the concept of “objects,” which can contain data and code to manipulate that data. They help organize complex programs and promote code reusability. Examples include Java, C++, and Python.
Functional Languages: These languages are based on mathematical functions and avoid changing states or mutable data. They are often used in applications that require high levels of mathematical computation. Examples include Lisp, Haskell, and Scala.
Scripting Languages: These languages are typically used for automating tasks and, unlike procedural languages, are often interpreted rather than compiled. They are commonly used for web development, system administration, and quick prototyping. Examples include JavaScript, Python, Bash, and Ruby.
Logic Programming Languages: These languages are based on formal logic and are primarily used in artificial intelligence and problem-solving applications. Prolog is a well-known example.
As you can see, Python appears in many of these categories. In Python, everything is an object; however, it can be used to write code in multiple styles—procedural, object-oriented, or even functional. Python offers a lot of versatility while being very accessible for beginners.
Programming and Data Science#
R and Python, along with SQL (which is not covered in this course), are the most widely used programming languages for Data Science. Here are some statistics from the 2022 Kaggle survey: (https://www.kaggle.com/competitions/kaggle-survey-2022):
# Import libraries to read data and visualize
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
# Read survey data
survey_data = pd.read_csv("../../data/kaggle_survey_2022_responses.csv", low_memory=False)
# Count number of responses per category (programming language)
languages_dat = survey_data.filter(regex="Q12").apply(lambda x: x[1:].value_counts()).fillna(0).sum(axis=1)
# Get number of responses
n_obs = survey_data.shape[0]
# Accomodate data for plotting
languages_dat = pd.DataFrame({"language": languages_dat.index, "use": 100*languages_dat/n_obs})
languages_dat = languages_dat.reset_index(drop=True)
# Plot results
sns.barplot(data=languages_dat, y="language", x="use", orient="h")
plt.tick_params(labelsize=15)
plt.xlabel("Use (%)", size=20)
plt.ylabel("")
plt.title("Programming & Data Science", size=25)
sns.despine()
As you can see, Python beats R numbers in terms of use. Why is that?
Python is a multi-paradigm, multi-purpose language, meaning it can be used for a wide variety of applications beyond data analysis, such as web development, automation, and software development. In contrast, R was specifically designed for statistical analysis, so its use and development have predominantly focused on this area.
While both languages may have a similar learning curve for beginners, those with previous programming experience may find Python more intuitive than R. Python also has a more “readable” syntax, as well as greater expressiveness and conciseness.
Although both languages are widely used in academia, Python’s versatility and better suitability for production make it the first choice for data science applications in industry.
In line with the previous point, machine learning is easier in Python, which also has many more resources for this type of data analysis. For example, scikit-learn (https://scikit-learn.org/stable/) for general machine learning, and TensorFlow (https://www.tensorflow.org/) and PyTorch (https://pytorch.org/) for deep learning, are among the most widely used libraries, all of which are written in Python.
Some people argue that Python is more powerful than R for data cleaning, due to its more diverse data structures and superior implementation of regular expressions.
To be fair, R is also a great language, and I use it almost every day. Here are some features where I believe R excels over Python:
If you’re conducting statistical analysis, R is the better choice. As it was specifically designed for this purpose, there has been extensive method development in this area. While Python has Statsmodels (https://www.statsmodels.org), which works well for most typical statistical analyses, R remains superior in this domain.
Visualization with the ggplot2 package (https://ggplot2.tidyverse.org) is a delight in R. You can create highly appealing and professional-looking plots. In fact, many respected publications, like The New York Times, include graphics generated in R. Python also has excellent visualization packages. For example, if you’re familiar with Matlab, using Matplotlib (http://matplotlib.org) in Python would feel very natural. Seaborn (https://seaborn.pydata.org) is another great Pythonic alternative for visualization, leveraging the power of data frames.
Data exploration in R using dplyr (https://dplyr.tidyverse.org) is often easier than in Python using Pandas (https://pandas.pydata.org), particularly because dplyr’s syntax is more readable and, for me, easier to remember.