Version Control System
Git and co.
The Origin of Git
Until 2005, the Linux kernel project, one of the largest open-source projects in the world, used a proprietary, distributed version control system (VCS) called BitKeeper. However, the license for free use was revoked from the kernel team. This created an acute problem: a new VCS was needed that could meet the project's extreme requirements:
- Distributed: Thousands of developers worldwide had to be able to collaborate efficiently.
- Performant: Operations like branching and merging had to be extremely fast.
- Secure: The integrity of the vast code repository had to be guaranteed at all times.
Since no existing solution met these criteria, Linus Torvalds, the initiator of Linux, took matters into his own hands.
Within a few weeks, Linus Torvalds developed the core of Git. His goal was not to create a user-friendly system, but an extremely fast and robust foundation. The first version was minimalistic, consisting of simple command-line tools that already implemented the core principles of Git.
Linus Torvalds' main interest remained the Linux kernel. After laying the foundation for Git, he handed over the project in July 2005 to Junio C Hamano, one of the earliest and most important contributors.
Under Hamano's leadership, Git became what we know today.
Git's real breakthrough with the general public came with the rise of code-hosting platforms, also known as "forges."
These platforms extend pure version control with crucial collaboration features:
- GitHub (2008): Made Git accessible through a graphical interface and popularized the "Pull Request" workflow, which is now the standard for open-source collaboration.
- GitLab (2011): Positioned itself as a "complete DevOps platform" and, in addition to code hosting, offers integrated CI/CD pipelines, issue tracking, and more. GitLab is very popular both as a SaaS and as a self-hosted solution.
- Gitea, Bitbucket, etc.: There are many other players. Gitea is a popular, lightweight self-hosted alternative to GitHub.
More than just Git
Both before and after Git, there have been many other VCSs.
- Subversion (SVN, 2000): A centralized VCS from the Apache Software Foundation that versions directories and files, developed as a successor to CVS. It is rarely used anymore.
- Perforce Helix Core (1995): A commercial, centralized VCS distinguished by high performance in large mono-repositories and fine-grained access control. It is widely used for video games.
- Mercurial (2005): A distributed VCS written in Python, emphasizing simplicity, speed, and consistency of the command-line interface. It is rarely used anymore, except at Meta.
- Fossil (2006): An integrated system implemented in C by D. Richard Hipp (author of SQLite) with a built-in bug tracker, wiki, and web interface in a single program. The entire repository is stored in a single SQLite database. It is rarely used, except for SQLite.
- Pijul (2014): An experimental, distributed VCS in Rust, based on the theory of patches, aiming for simpler merges and better formal correctness. It is rarely used.
- Jujutsu (2021): A distributed, Git-compatible VCS in Rust, initiated by Martin von Zweigbergk (Google), with a focus on more intuitive history editing and advanced merge strategies. It is actively used in some Git repositories. However, exact user numbers are difficult to determine due to its Git compatibility.
Changes
Many believe that Git only stores the changes from one commit to the next. Almost no VCS does this because it is inefficient.
Most VCSs store snapshots. This is a list of all files contained in a commit. A file in Git has a name (path), an executable flag, and content. Thus, all files are stored in every commit, not just those that have changed. Furthermore, it does not matter how much a file has changed; it is saved completely anew.
However, there is an important optimization: it constantly happens that a file's content appears multiple times. If a file does not change in a commit, its entire content does not need to be saved a second time. Likewise, if two files have the same content, it only needs to be stored once.
This can be compared to PNPM, which stores NPM packages centrally and
references them via symlinks in the node_modules
directory instead
of copying them, (among other things) to save disk space.
Merge
When two commits are merged, the VCS must perform a three-way merge. In this process, the commit history is treated as a DAC (Directed Acyclic Graph). In a DAC, it is easy to find the LCA (Lowest Common Ancestor). This is the commit that is a parent of both commits to be merged and lies deepest in the DAC, meaning it is furthest from the initial commit.
This sounds complicated. However, represented graphically, it looks quite simple:
When commits B and C are to be merged, a common base is needed against which the changes from both commits can be compared. This common base is the most recent commit that is a parent of both B and C (the LCA).