clarification on git, central repositories and commit access lists_ Linus Torvalds
Re: clarification on git, central repositories and commit access lists[Posted August 22, 2007 by jake]
[table][tr][td][b] From[/b]:[/td] [td] [/td][td] Linus Torvalds [/td][/tr][tr][td][b] To[/b]:[/td] [td] [/td][td] Adam Treat [/td][/tr][tr][td][b] Subject[/b]:[/td] [td] [/td][td] Re: clarification on git, central repositories and commit access lists[/td][/tr][tr][td][b] Date[/b]:[/td] [td] [/td][td] Mon, 20 Aug 2007 11:41:05 -0700 (PDT)[/td][/tr][tr][td][b] Message-ID[/b]:[/td] [td] [/td][td] [/td][/tr][tr][td][b] Cc[/b]:[/td] [td] [/td][td] kde-core-devel-AT-kde.org[/td][/tr][/table] On Sun, 19 Aug 2007, Adam Treat wrote:
[color=DarkOrange]> I just watched your talk on git and wanted to ask for clarification on a [color=DarkOrange]> few points. Many of us in the KDE community are interested in git and [color=DarkOrange]> some even contemplate using git as the official SCM tool in the future.
As you are probably aware, some people have tried to import the whole KDE history into git. Quite frankly, the way git works (tracking whole trees at a time, never single files), that ends up being very painful, because it's an "all or nothing" approach.
So I'm hoping that if you guys are seriously considering git, you'd also split up the KDE repository so that it's not one single huge one, but with multiple smaller repositories (ie kdelibs might be one, and each major app would be its own), and then using the git "submodule" support to tie it all together.
[color=DarkOrange]> However, I think a few issues have been confused and want to see if you [color=DarkOrange]> can clarify.
[color=DarkOrange]> Your talk focused heavily on the evils of a central repository versus [color=DarkOrange]> the benefits of a distributed model. However, I wonder if what you [color=DarkOrange]> actually find distasteful is not a central repository per se, but rather [color=DarkOrange]> designing an SCM that relies upon communication with a central [color=DarkOrange]> repository to do branching/merging or offline development.
I certainly agree that almost any project will want a "central" repository in the sense that you want to have one canonical default source base that people think of as the "primary" source base.
But that should not be a technical distinction, it should be a social one, if you see what I mean. The reason? Quite often, certain groups would know that there is a primary archive, but for various reasons would want to ignore that knowledge: the reasons can be any of
For an example of "release management", think of multiple different vendors. They would probably always start with your "central" release tree (which in turn may well be different from your central development tree!), but vendors invariably have their own timetables and customer issues, so they usually need to make decisions that may not even make sense for the "official" tree.
Examples of this in the kernel is how my tree is the central development tree, then we have the "stable" tree (which is a separate thing, maintained totally separately, but obviously based on my releases), and then each vendor tends to have their own "release trees". They are all different, they all have different policies and reasons for existence, and they are all "central" depending on who looks at them.
Both of those are horrible mistakes: the "globally visible" part means that if you're not sure this makes sense, you're much less likely to begin a branch - even if it's cheap, it's still something that everybody else will see, and as such you can't really do "throwaway" development that way. And let's face it, many cool ideas turn out to be totally idiotic, but it might take a long time until it's obvious that it was a bad idea.
So you absolutely need private branches, that can becom "central" for the people involved in some re-architecting, even if they never ever show up in the "truly central" repository. That's a huge deal for development.
The other problem is the "permission from maintainers" thing: I have an ego the size of a small planet, but I'm not always right, and in that kind of situation it would be a total disaster if everybody had to ask for my permission to create a branch to do some re-architecting work.
The fact that anybody can create a branch without me having to know about it or care about it is a big issue to me: I think it keeps me honest. Basically, the fundamental tool we use for the kernel makes sure that if I'm not doing a good job, anybody else can show people that they do a better job, and nobody is really "inconvenienced".
Compare that to some centralized model, and something like the gcc/egcs fork: the centralized model made the fork so painful that it became a huge political fight, instead of just becoming an issue of "we can do this better"!
There are other reasons for having a social network that tends to have one or two fairly central nodes, but not having a technical limitation that enforces that. But the above are the two biggest and most important reasons, I think-
[color=DarkOrange]> After all, your repository acts as a de-facto central repository of the [color=DarkOrange]> linux kernel in as much as everyone pulls from it. Without such a [color=DarkOrange]> central place to pull the linux kernel would not exist, rather what [color=DarkOrange]> you'd have is a bunch of forks which perhaps merge with each other from [color=DarkOrange]> time to time.
Well, I do want to make it clear that we do have such forks that pull from each other too. So the kernel actually does use the technology, it's just that you have to be involved in the particular subprojects to even know or care about it!
So it's not strictly true that there is a single "central" one, even if you ignore the stable tree (or the vendor trees). There are subsystems that end up working with each other even before they hit the central tree
To put this in a KDE perspective: it would make tons and tons of sense to have one central place (kde.org) that most developers know about, and where they would fetch their sources from. But for various reasons (and security is one of them), that may not be the main place where most "core developers" really work. You would generally want to have separate places that are secure, and those separate places may be different for different developer groups.
For a kernel example: the "public" git tree is on the public kernel.org servers (including "git.kernel.org"), but that is actually not a machine that any developers really ever push to directly.
Many kernel developers use other kernel.org machines (because we have the infrastructure), but others will use their own setups entirely, because they might have issues like bandwidth (ie kernel.org may be reasonably well connected, but while it has mirrors elsewhere, the main machines are in the US, so some European developers prefer to just use servers that are closer).
So if you look at my merge messages, for example, you'll see things like merges from lm-sensors.org, git.kernel.dk, ftp.linux-mips.org, oss.sgi.com etc etc. The point being that yes, there is a central place that people know about, but at the same time, much of the development really happens outside that central place!
[color=DarkOrange]> For any software project to exist as opposed to a bunch of forks I think
[color=DarkOrange]> you have to have a central repository from which everyone pulls, no?
[color=DarkOrange]> Of course many branches might exist, but those branches must pull from a [color=DarkOrange]> central repository if they want to share at least some common code.
Practically speaking, you'd generally have one or a few central repositories, yes. But no, it really doesn't have to be a single one. And I'm not just talking about mirroring (which is really easy with a distributed setup), I'm literally talking about things like some people wanting to use the "stable" tree, and not my tree at all, or the vendor trees.
And they are obviously connected, but it doesn't have to be a totally central notion at all.
Think of the git trees as people: some people are more "central" than others, but in the end, the kernel is actually fairly unusual (at least for a big project) in having just one person that is so much in the "center" that everybody knows about him.
In most other projects, you literally would have different groups that handle different parts. In the KDE group, for example, there really is no reason why the people who work on one particular application should ever use the same "central" repository as the people who work on another app do.
You'd have a separate group (that probably also maintains some central part like the kdelibs stuff) that might be in charge of integrating it all, and that integration/core group might be seen to outsiders as the "one central repository", but to the actual application developers, that may actually be pretty secondary, and as with the kernel, they may maintain their own trees at places like ftp.linux-mips.org - and then just ask the core people to pull from them when they are reasonably ready.
See? There's really no more "one central place" any more. To the casual observer, it looks like one central place (since casual users would always go for the core/integration tree), but the developers themselves would know better. If you wanted to develop some bleeding edge koffice stuff, you'd use that tree - and it might not have been merged into the core tree yet, because it might be really buggy at the moment!
This is one of the big advantages of true distribution: you can have that kind of "central" tree that does integration, but it doesn't actually have to integrate the development "as it happens". In fact, it really really shouldn't. If you look at my merges, for example, when I merge big changes from somebody else who actually maintains them in a git tree, they will have often been done much earlier, and be a series of changes, and I only merge when they are "ready".
So the core/central people should generally not necessarily even do any real development at all: the tree that people see as the "one tree" is really mostly just an integration thing. When the koffice/kdelibs/whatever people decide that they are ready and stable, they can tell the integration group to pull their changes. There's obviously going to be overlap between developers/integrators (hopefully a lot of overlap), but it doesn't have to be that way (for example, I personally do almost only integration, and very little serious development).
[color=DarkOrange]> A central repository is also necessary for projects like KDE to enable [color=DarkOrange]> things like buildbots and commit mailing lists.
Yes, you want a central build-bot and commit mailing list. But you don't necessarily want just one central build-bot and commit mailing list.
There's absolutely no reason why everybody would be interested in some random part of the tree (say, kwin), and there's no reason why the people who really only do kwin stuff should have to listen to everybody elses work. They may well want to have their own build-bot and commit mailing list!
So making one central one is certainly not a mistake, but making only a central one is. Why shouldn't the groups that do specialized work have specialized test-farms? The kernel does. The NFS stuff, for example, tends to have its own test infrastructure.
Also, it's a mistake to think that one site has to do everything. That's not what we do in the kernel, for example. Yes, we have kernel.org, and it's reasonably central, but that doesn't mean that everything has to, or even should, happen within that organization.
So we've had people do build-bots and performance regressions, and specialized testing outside of kernel.org. For example, intel and others have done things like performance regression testing that required specialized hardware and software (eg TPC-C performance numbers).
So we do commit mailing lists from kernel.org, but (a) that doesn't mean that everything else should be done from that central site and (b) it also doesn't mean that subprojects shouldn't do their own commit mailing lists. In fact, there's a "gitstat" project (which tracks the kernel, but it's designed to be available for any git project), and you can see an example of it in action at
(or get the source code from sourceforge), and the point is that all of this was done entirely outside the kernel.org framework.
So centralized is not at all always good. Quite the reverse: having distributed services allows specialized services, and it also allows the above kind of experimental stuff that does some (fairly simple, but maybe it will expand) data-mining on the project!
[color=DarkOrange]> These tools are important to the way we work and provide for many eyes [color=DarkOrange]> constantly reviewing changes to the codebase as well as regular [color=DarkOrange]> regression testing across diverse platforms. In the future, whether git [color=DarkOrange]> or svn, I see no advantages in getting rid of a central repository from [color=DarkOrange]> which everyone pulls. I wonder whether you really disagree.
So I do disagree, but only in the sense that there's a big difference between "a central place that people can go to" and "ONLY ONE central place".
See? Distribution doesn't mean that you cannot have central places - but it means that you can have different central places for different things. You'd generally have one central place for "default" things (kde.org), but other central places for more specific or specialized services!
And whether it's specialized by project, or by things like the above "special statistics" kind of thing, or by usage, is another matter! For example, maybe you have kde.org as the "default central place", but then some subgroup that specializes in mobility and small-memory-footprint issues might use something like kde.mobile.org as their central site, and then developers would occasionally merge stuff (hopefully both ways!)
[color=DarkOrange]> In your talk you also focus on the evils of commit access lists, [color=DarkOrange]> comparing and contrasting with the web of trust the kernel uses where [color=DarkOrange]> you have no commit access lists at all. However, isn't the kernel model [color=DarkOrange]> just a special case? The linux kernel has a de-facto commit access list [color=DarkOrange]> of one: you.
No, really. It doesn't. It's the one you see from the outside, but the fact is, different sub-parts of the kernel really do use their own trees, and their own mailing lists. You, as a KDE developer, would generally never care about it, so you only see the main one.
[color=DarkOrange]> This might work well for the kernel, but I fail to see how this really [color=DarkOrange]> reduces politics. Many are still constantly pushing and arguing to [color=DarkOrange]> merge their branches upstream into your repository. Would having a [color=DarkOrange]> central repository where you and all your trusted lieutenants push their [color=DarkOrange]> changes really be very different?
Yes it would be. You only see the end result now. You don't see how those lieutenants have their own development trees, and while the kernel is fairly modular (so the different development trees seldom have to interact with each others), they do interact. We've had the SCSI development tree interact with the "block layer" development tree, and all you ever see is the end result in my tree, but the fact is, the development happened entirely outside my tree.
The networking parts, for example, merge the crypto changes, and I then merge the end result of the crypto and network changes.
Or take the powerpc people: they actually merge their basic architecture stuff to me, but their network driver stuff goes through Jeff Garzik - and you as a user never even realize that there was another "central" tree for network driver development, because you would never use it unless you had reported a bug to Jeff, and Jeff might have sent you a patch for it, or alternatively he might have asked if you were a git user, and if so, please pull from his 'e1000e' branch.
For an example of this, go to
and look at all the projects there. There are lots of kernel subprojects that are used by developers - exactly so that if you report a bug against a particular driver or subsystem, the developer can tell you to test an experimental branch that may fix it.
[color=DarkOrange]> The KDE community has a very large commit access list and it is quite [color=DarkOrange]> easy to join. Having a central git repository with a large set of [color=DarkOrange]> committers would seem to map well with our community. I fail to see any [color=DarkOrange]> harm in this model. The web of trust would still exist, it would just [color=DarkOrange]> be much larger and more inclusive than the model the kernel uses. I [color=DarkOrange]> wonder if you disagree.
Hey, you can use your old model if you want to. git doesn't force you to change. But trust me, once you start noticing how different groups can have their own experimental branches, and can ask people to test stuff that isn't ready for mainline yet, you'll see what the big deal is all about.
Centralized works. It's just inferior.
[color=DarkOrange]> Another sticking point is the performance implications of a git [color=DarkOrange]> repository managing something the size of the KDE project. I understand [color=DarkOrange]> the straightforward solution: just define content boundaries with a [color=DarkOrange]> separate git repo for each submodule: kdelibs.git, kdebase.git, [color=DarkOrange]> kdesupport.git, etc, etc. And then have a super git repo with hooks [color=DarkOrange]> that point to these submodules. However, I think this leads to a few [color=DarkOrange]> problems. [color=DarkOrange]> [color=DarkOrange]> What if I want to make a commit to kdelibs that will require changes in [color=DarkOrange]> other modules for them to compile. I will no longer be able to make a [color=DarkOrange]> single atomic commit with changes to multiple submodules, right?
Sure you will. It's hierarchical, though.
What happens is that you do a single commit in each submodule that is atomic to that private copy of that submodule (and nobody will ever see it on its own, since you'd not push it out), and then in the supermodule you make another commit that updates the supermodule to all the changes in each submodule.
See? It's totally atomic. Anybody that updates from the supermodule will get one supermodule commit, when when that in turn fetches all the submodule changes, you never have any inconsistent state.
[color=DarkOrange]> Also, won't we lose history when moving files/content between [color=DarkOrange]> submodules?
Yes. If you move stuff between repositories, you do lose history (or rather, it breaks it as far as git is concerned - you still obviously have both pieces of history, but to see it, you'd have to manually go and look).
The point of submodules is that they are totally independent entities in their own right, so that you can develop on a submodule without having to even know about or care about the supermodule.
Git actually does perform fairly well even for huge repositories (I fixed a few nasty problems with 100,000+ file repos just a week ago), so if you absolutely have to, you can consider the KDE repos to be just one single git repository, but that unquestionably will perform worse for some things (notably, "git annotate/blame" and friends).
But what's probably worse, a single large repository will force everybody to always download the whole thing. That does not necessarily mean the whole history - git does support the notion of "shallow clones" that just download part of the history - but since git at a very fundamental level tracks the whole tree, it forces you to download the whole "width" of the tree, and you cannot say "I want just the kdelibs part".
[color=DarkOrange]> And how will we break up the existing history between all of these [color=DarkOrange]> submodules?
There's a few options for that.
One is to just import the SVN history per directory in the first place, but that makes it hard to then tie the history together in the supermodule.
The better approach is probably to import the whole thing (which will require a rather beefy machine), and then split it up from within git. There are various tools on the git side to basically rewrite the history in other formats, including splitting up a bigger repository (google for "git-split", for example).
But I certainly won't lie to you: importing all the history of KDE is going to be a fairly big project, and it will require people who have good git knowledge to set it up. I suspect (judging by some noises I've seen on the git mailing list and irc channel) that you have those kinds of people already, but it may well be a good idea to avoid doing it as one big "everything at once" kind of event.
So seriously, I would suggest that if there is currently some smaller part of the KDE SVN tree, and the people who work on that part are already more familiar with git than most KDE people necessarily are, I suspect that the best thing to do is to convert just that piece first, and have people migrate in pieces. Because any SCM move is going to be a learning process (the CVS->SVN one is much easier than most, since they really are largely just different faces of the same coin - no real changes in how things fundamentally work as far as the user experience is concerned).
[color=DarkOrange]> Finally, a couple points... CVS/SVN might be stupid and moronic, but I
[color=DarkOrange]> think it is good to note they are not nearly as bad as some other SCM's.
[color=DarkOrange]> Many SCM's used by some of the largest codebases in the world are still [color=DarkOrange]> lock-based. If you think it is difficult to branch/merge using a [color=DarkOrange]> central server, remember that some poor folks can't even change a [color=DarkOrange]> single file without asking the central server for permission.
Sure. Crap exists. That doesn't make CVS/SVN good. It just means that there are even worse things out there.
[color=DarkOrange]> It is also good to note that a free distributed SCM was not available
[color=DarkOrange]> until recently. The kernel community might have had a special deal with
[color=DarkOrange]> BitKeeper, but the same didn't apply to all open source projects AFAIK.
[color=DarkOrange]> When KDE moved to svn it was the best tool for the job. That might have [color=DarkOrange]> changed when git became easier to use, but at the time it was simply too [color=DarkOrange]> big of a barrier for new developers and too new. And from what I [color=DarkOrange]> understand git support on other platforms is a recent development.
Git works pretty well on any random unix (although most users are on Linux, with a reasonable minority on OS X - everything else tends to be pretty spotty, and can at times require that you add compiler options etc).
The native windows support is pretty recent, and still in flux. It's now apparently quite usable, although I don't think there's any real integration with any native Windows development environments (ie it's all either command line or the "native" git visualization tools like git-gui or gitk).