git in Data Science

Source Control Basics: How to set up, configure, and work with git and GitHub.

Setting Up Basic Commands Advanced Commands Actions


What is Source Control?

Source control, also known as “version control,” is the process of tracking changes and versions of electronic files over time. This might include code, data, images, and other files. Good source control management tools allow users to see the entire history of changes to any specific file, or the evolution of an entire project. git is currently the most popular and feature-rich source control tool.

What is git?

Git is a distributed version control system that tracks changes in any set of computer files, usually used for coordinating work among programmers who are collaboratively developing source code during software development. Its goals include speed, data integrity, and support for distributed, non-linear workflows. Wikipedia

What is GitHub?

GitHub is a developer platform that allows developers to create, store, manage and share their code. It uses Git software, providing the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous integration, and wikis for every project. It currently hosts work by approximately 100M developers. Wikipedia

Source Control in Data Science

Data aggregation, cleaning, pipelines and ML models all rely on software in order to operate. Responsible software management depends on well-managed code, versioning, prioritizing bugs, features, and user issues. Further, modern platforms and infrastructure tend to favor code-driven tests, builds, deployments, and management. Code can be used to define all the layers of effort across teams of engineers and data scientists.

Which is to say: Code is fundamental to our work, and it would be risky, inefficient, and impractical not to use source control.

Contents