Git has historically had many components implemented in the form of shell scripts. This was less than ideal for several reasons:
The goal of this project is to complete the conversion of the remaining parts of
git submodule to C, namely, the
update commands. If possible, I intend to even get rid of the shell script called
git-submodule.sh entirely, which currently calls
submodule--helper to perform most of the business logic, and instead make submodule a proper builtin in pure C.
My project can be broadly divided into three components:
git-submodule.sh, which in turn call some C helper commands that do the real work. My goal is to finally remove the shell script middleman, and simplify the architecture.
Here are the statuses of each component.
seen, but not merged. Links to patches:
From “What’s Cooking in git.git”, 16 Aug 2021:
* ar/submodule-add-config (2021-08-10) 1 commit - submodule--helper: introduce add-config subcommand (this branch is used by ar/submodule-add-more.) Large part of "git submodule add" gets rewritten in C. [...] * ar/submodule-add-more (2021-08-10) 10 commits - submodule--helper: rename compute_submodule_clone_url() - submodule--helper: remove resolve-relative-url subcommand - submodule--helper: remove add-config subcommand - submodule--helper: remove add-clone subcommand - submodule--helper: convert the bulk of cmd_add() to C - dir: libify and export helper functions from clone.c - submodule--helper: remove repeated code in sync_submodule() - submodule--helper: refactor resolve_relative_url() helper - submodule--helper: add options for compute_submodule_clone_url() - Merge branch 'ar/submodule-add-config' into ar/submodule-add (this branch uses ar/submodule-add-config.) More parts of "git submoudle add" has been rewritten in C.
seen, reviews still ongoing. Link to patch:
From “What’s Cooking in git.git”, 16 Aug 2021:
* ar/submodule-run-update-procedure (2021-08-13) 1 commit - submodule--helper: run update procedures from C Reimplementation of parts of "git submodule" in C continues.
git submodulecommand to C builtin
statushas been converted to a pure C builtin. The source code can be found here.
There were other tiny contributions that I made to Git as well.
I will still carry on with the reviews for the patches that are in flight. I will also continue sending the many patches that I am still holding on to. Let’s hope we can finish off the submodule conversion effort that has been going on for over five years now!
Here are some things I found challenging while working on this project.
This was a large overarching theme across all the three parts of my work. Projects like Git do not operate by developers merely churning out a boatload of code. There needs to be structure to the changes that are sent, so that they can be easily reviewed, and held to high standards. This would be hard to do if I had done the conversion in one single run and sent a giant series to the mailing list.
Given the complexity of the existing submodule code, it was not trivial to break up the changes into convenient bite-sized pieces. There was always a tension between “is this series too big?” and “is breaking this change into multiple series making it to complicated to follow the changes?”
I would not be surprised if more than half the times I was asked to modify my changes from mentors and listfolk were because of reasons related to how I structured my patches. This taught me how effective communication makes software scale—your changes should tell a story that’s easy to follow, so that the code can easily be picked up by others by a mere examination of its commit and list history.
Since my project was converting a lot of shell code, it was not always easy to find an equivalent in the C API of Git, especially for all those
rev-list calls that do a lot of git-fu to retrieve commit information. There was always an escape hatch—we could fork a process and run the shell invocation. But this was not ideal, and I tried to avoid it as much as possible.
submodule update command has the
--recursive flag, which updates nested modules recursively down all paths. So roughly speaking, the implementation did this by running the update for the current worktree root, and then running update again, but by switching the root path to inside a submodule in a recursive process of
I quite like the elegance of recursive functions, but I can’t say the same for recursive forked processes. The shell version was still smooth because the script had its own setup, and the way of handling the environment that was not too problematic. But translating this to C introduced a bunch of problems, like certain environment variables not being updated properly. I had to apply a small band-aid fix that made it work nicely for me. It was an overall win, because it led to me discovering some much needed refactoring for some repeated code, and it also helped me save a subprocess spawn.
Ideally we should not be forking processes recursively at all, because of how expensive and finnicky they are. Unfortunately, the Git submodule API is not quite there yet to make a more elegant solution happen. A lot of the configuration functions still operate only on the global
the_repository objects, which makes recursive submodule operations not work correctly if done in the same process, as it will mess with the state of the root repository state. But we are getting there. The pieces are present, it just needs to be assembled. Maybe this could be a good idea for a future GSoC/Outreachy project?
This project confirmed a belief that I held. Learning isn’t sitting in class. Doing is learning. I joined this project as an impostor who knew nothing about anything. But now I know a lot more things, all forced by the requirements of my projects. Here’s the brief: