LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Question about your git habits
@ 2008-02-23  0:37 Chase Venters
  2008-02-23  1:36 ` J.C. Pizarro
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Chase Venters @ 2008-02-23  0:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: git

I've been making myself more familiar with git lately and I'm curious what 
habits others have adopted. (I know there are a few documents in circulation 
that deal with using git to work on the kernel but I don't think this has 
been specifically covered).

My question is: If you're working on multiple things at once, do you tend to 
clone the entire repository repeatedly into a series of separate working 
directories and do your work there, then pull that work (possibly comprising 
a series of "temporary" commits) back into a separate local master 
respository with --squash, either into "master" or into a branch containing 
the new feature?

Or perhaps you create a temporary topical branch for each thing you are 
working on, and commit arbitrary changes then checkout another branch when 
you need to change gears, finally --squashing the intermediate commits when a 
particular piece of work is done?

I'm using git to manage my project and I'm trying to determine the most 
optimal workflow I can. I figure that I'm going to have an "official" master 
repository for the project, and I want to keep the revision history clean in 
that repository (ie, no messy intermediate commits that don't compile or only 
implement a feature half way).

On older projects I was using a certalized revision control system like 
*cough* Subversion *cough* and I'd create separate branches which I'd check 
out into their own working trees.

It seems to me that having multiple working trees (effectively, cloning 
the "master" repository every time I need to make anything but a trivial 
change) would be most effective under git as well as it doesn't require 
creating messy, intermediate commits in the first place (but allows for them 
if they are used). But I wonder how that approach would scale with a project 
whose git repo weighed hundreds of megs or more. (With a centralized rcs, of 
course, you don't have to lug around a copy of the whole project history in 
each working tree.)

Insight appreciated, and I apologize if I've failed to RTFM somewhere.

Thanks,
Chase

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
@ 2008-02-23  1:36 ` J.C. Pizarro
  2008-02-23  1:37 ` Jan Engelhardt
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: J.C. Pizarro @ 2008-02-23  1:36 UTC (permalink / raw)
  To: Chase Venters, LKML

2008/2/23, Chase Venters <chase.venters@clientec.com> wrote:
>
> ... blablabla
>
>  My question is: If you're working on multiple things at once, do you tend to
>  clone the entire repository repeatedly into a series of separate working
>  directories and do your work there, then pull that work (possibly comprising
>  a series of "temporary" commits) back into a separate local master
>  respository with --squash, either into "master" or into a branch containing
>  the new feature?
>
> ... blablabla
>
>  I'm using git to manage my project and I'm trying to determine the most
>  optimal workflow I can. I figure that I'm going to have an "official" master
>  repository for the project, and I want to keep the revision history clean in
>  that repository (ie, no messy intermediate commits that don't compile or only
>  implement a feature half way).

I recomend you to use these complementary tools

   1. google: gitk screenshots  ( e.g. http://lwn.net/Articles/140350/ )

   2. google: "git-gui" screenshots
         ( e.g. http://www.spearce.org/2007/01/git-gui-screenshots.html )

   3. google: gitweb color meld

   ;)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
  2008-02-23  1:36 ` J.C. Pizarro
@ 2008-02-23  1:37 ` Jan Engelhardt
  2008-02-23  1:44   ` Al Viro
  2008-02-23  4:10 ` Daniel Barkalow
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Jan Engelhardt @ 2008-02-23  1:37 UTC (permalink / raw)
  To: Chase Venters; +Cc: linux-kernel, git


On Feb 22 2008 18:37, Chase Venters wrote:
>
>I've been making myself more familiar with git lately and I'm curious what 
>habits others have adopted. (I know there are a few documents in circulation 
>that deal with using git to work on the kernel but I don't think this has 
>been specifically covered).
>
>My question is: If you're working on multiple things at once,

Impossible; Humans only have one core with only seven registers --
according to CodingStyle chapter 6 paragraph 4.

>do you tend to clone the entire repository repeatedly into a series
>of separate working directories

Too time consuming on consumer drives with projects the size of Linux.

>and do your work there, then pull
>that work (possibly comprising a series of "temporary" commits) back
>into a separate local master respository with --squash, either into
>"master" or into a branch containing the new feature?

No, just commit the current unfinished work to a new branch and deal
with it later (cherry-pick, rebase, reset --soft, commit --amend -i,
you name it). Or if all else fails, use git-stash.

You do not have to push these temporary branches at all, so it is
much nicer than svn. (Once all the work is done and cleanly in
master, you can kill off all branches without having a record
of their previous existence.)

>Or perhaps you create a temporary topical branch for each thing you
>are working on, and commit arbitrary changes then checkout another
>branch when you need to change gears, finally --squashing the
>intermediate commits when a particular piece of work is done?

if I don't collect arbitrary changes, I don't need squashing
(see reset --soft/amend above)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23  1:37 ` Jan Engelhardt
@ 2008-02-23  1:44   ` Al Viro
  2008-02-23  1:51     ` Junio C Hamano
  0 siblings, 1 reply; 20+ messages in thread
From: Al Viro @ 2008-02-23  1:44 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Chase Venters, linux-kernel, git

On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:

> >do you tend to clone the entire repository repeatedly into a series
> >of separate working directories
> 
> Too time consuming on consumer drives with projects the size of Linux.

git clone -l -s

is not particulary slow...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23  1:44   ` Al Viro
@ 2008-02-23  1:51     ` Junio C Hamano
  2008-02-23  2:09       ` Al Viro
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2008-02-23  1:51 UTC (permalink / raw)
  To: Al Viro; +Cc: Jan Engelhardt, Chase Venters, linux-kernel, git

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:
>
>> >do you tend to clone the entire repository repeatedly into a series
>> >of separate working directories
>> 
>> Too time consuming on consumer drives with projects the size of Linux.
>
> git clone -l -s
>
> is not particulary slow...

How big is a checkout of a single revision of kernel these days,
compared to a well-packed history since v2.6.12-rc2?

The cost of writing out the work tree files isn't ignorable and
probably more than writing out the repository data (which -s
saves for you).



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23  1:51     ` Junio C Hamano
@ 2008-02-23  2:09       ` Al Viro
  2008-02-23  2:23         ` J.C. Pizarro
  0 siblings, 1 reply; 20+ messages in thread
From: Al Viro @ 2008-02-23  2:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jan Engelhardt, Chase Venters, linux-kernel, git

On Fri, Feb 22, 2008 at 05:51:04PM -0800, Junio C Hamano wrote:
> Al Viro <viro@ZenIV.linux.org.uk> writes:
> 
> > On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:
> >
> >> >do you tend to clone the entire repository repeatedly into a series
> >> >of separate working directories
> >> 
> >> Too time consuming on consumer drives with projects the size of Linux.
> >
> > git clone -l -s
> >
> > is not particulary slow...
> 
> How big is a checkout of a single revision of kernel these days,
> compared to a well-packed history since v2.6.12-rc2?
> 
> The cost of writing out the work tree files isn't ignorable and
> probably more than writing out the repository data (which -s
> saves for you).

Depends...  I'm using ext2 for that and noatime everywhere, so that might
change the picture, but IME it's fast enough...  As for the size, it gets
to ~320Mb on disk, which is comparable to the pack size (~240-odd Mb).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23  2:09       ` Al Viro
@ 2008-02-23  2:23         ` J.C. Pizarro
  2008-02-23  8:44           ` Alexey Dobriyan
       [not found]           ` <998d0e4a0802221847m431aa136xa217333b0517b962@mail.gmail.com>
  0 siblings, 2 replies; 20+ messages in thread
From: J.C. Pizarro @ 2008-02-23  2:23 UTC (permalink / raw)
  To: Al Viro, LKML

On 2008/2/23, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Feb 22, 2008 at 05:51:04PM -0800, Junio C Hamano wrote:
>  > Al Viro <viro@ZenIV.linux.org.uk> writes:
>  >
>  > > On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:
>  > >
>  > >> >do you tend to clone the entire repository repeatedly into a series
>  > >> >of separate working directories
>  > >>
>  > >> Too time consuming on consumer drives with projects the size of Linux.
>  > >
>  > > git clone -l -s
>  > >
>  > > is not particulary slow...
>  >
>  > How big is a checkout of a single revision of kernel these days,
>  > compared to a well-packed history since v2.6.12-rc2?
>  >
>  > The cost of writing out the work tree files isn't ignorable and
>  > probably more than writing out the repository data (which -s
>  > saves for you).
>
>
> Depends...  I'm using ext2 for that and noatime everywhere, so that might
>  change the picture, but IME it's fast enough...  As for the size, it gets
>  to ~320Mb on disk, which is comparable to the pack size (~240-odd Mb).

Yesterday, i had git cloned git://foo.com/bar.git   ( 777 MiB )
Today, i've git cloned git://foo.com/bar.git   ( 779 MiB )

Both repos are different binaries , and i used 777 MiB + 779 MiB = 1556 MiB
of bandwidth in two days. It's much!

Why don't we implement "binary delta between old git repo and recent git repo"
with "SHA1 built git repo verifier"?

Suppose the size cost of this binary delta is e.g. around 52 MiB instead of
2 MiB due to numerous mismatching of binary parts, then the bandwidth
in two days will be 777 MiB + 52 MiB = 829 MiB instead of 1556 MiB.

Unfortunately, this "binary delta of repos" is not implemented yet :|

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
  2008-02-23  1:36 ` J.C. Pizarro
  2008-02-23  1:37 ` Jan Engelhardt
@ 2008-02-23  4:10 ` Daniel Barkalow
  2008-02-23  5:03   ` Jeff Garzik
  2008-02-23  9:18   ` Mike Hommey
  2008-02-23  4:39 ` Rene Herman
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 20+ messages in thread
From: Daniel Barkalow @ 2008-02-23  4:10 UTC (permalink / raw)
  To: Chase Venters; +Cc: linux-kernel, git

On Fri, 22 Feb 2008, Chase Venters wrote:

> I've been making myself more familiar with git lately and I'm curious what 
> habits others have adopted. (I know there are a few documents in circulation 
> that deal with using git to work on the kernel but I don't think this has 
> been specifically covered).
> 
> My question is: If you're working on multiple things at once, do you tend to 
> clone the entire repository repeatedly into a series of separate working 
> directories and do your work there, then pull that work (possibly comprising 
> a series of "temporary" commits) back into a separate local master 
> respository with --squash, either into "master" or into a branch containing 
> the new feature?
> 
> Or perhaps you create a temporary topical branch for each thing you are 
> working on, and commit arbitrary changes then checkout another branch when 
> you need to change gears, finally --squashing the intermediate commits when a 
> particular piece of work is done?

I find that the sequence of changes I make is pretty much unrelated to the 
sequence of changes that end up in the project's history, because my 
changes as I make them involve writing a lot of stubs (so I can build) and 
then filling them out. It's beneficial to have version control on this so 
that, if I screw up filling out a stub, I can get back to where I was.

Having made a complete series, I then generate a new series of commits, 
each of which does one thing, without any bugs that I've resolved, such 
that the net result is the end of the messy history, except with any 
debugging or useless stuff skipped. It's this series that gets merged into 
the project history, and I discard the other history.

The real trick is that the early patches in a lot of series often refactor 
existing code in ways that are generally good and necessary for your 
eventual outcome, but which you'd never think of until you've written more 
of the series. Generating a new commit sequence is necessary to end up 
with a history where it looks from the start like you know where you're 
going and have everything done that needs to be done when you get to the 
point of needing it. Furthermore, you want to be able to test these 
commits in isolation, without the distraction of the changes that actually 
prompted them, which means that you want to have your working tree is a 
state that you never actually had it in as you were developing the end 
result.

This means that you'll usually want to rewrite commits for any series that 
isn't a single obvious patch, so it's not a big deal to commit any time 
you want to work on some different branch.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
                   ` (2 preceding siblings ...)
  2008-02-23  4:10 ` Daniel Barkalow
@ 2008-02-23  4:39 ` Rene Herman
  2008-02-23  8:56 ` Willy Tarreau
  2008-02-23  9:10 ` Sam Ravnborg
  5 siblings, 0 replies; 20+ messages in thread
From: Rene Herman @ 2008-02-23  4:39 UTC (permalink / raw)
  To: Chase Venters; +Cc: linux-kernel, git

On 23-02-08 01:37, Chase Venters wrote:

> Or perhaps you create a temporary topical branch for each thing you are 
> working on, and commit arbitrary changes then checkout another branch
> when you need to change gears, finally --squashing the intermediate
> commits when a particular piece of work is done?

No very specific advice to give but this is what I do and then pull all 
(compilable) topic branches into a "local" branch for complation. Just 
wanted to remark that a definite downside is that switching branches a lot 
also touches the tree a lot and hence tends to trigger quite unwelcome 
amounts of recompiles. Using ccache would proably be effective in this 
situation but I keep neglecting to check it out...

Rene

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23  4:10 ` Daniel Barkalow
@ 2008-02-23  5:03   ` Jeff Garzik
  2008-02-23  9:18   ` Mike Hommey
  1 sibling, 0 replies; 20+ messages in thread
From: Jeff Garzik @ 2008-02-23  5:03 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Chase Venters, linux-kernel, git

Daniel Barkalow wrote:
> I find that the sequence of changes I make is pretty much unrelated to the 
> sequence of changes that end up in the project's history, because my 
> changes as I make them involve writing a lot of stubs (so I can build) and 
> then filling them out. It's beneficial to have version control on this so 
> that, if I screw up filling out a stub, I can get back to where I was.
> 
> Having made a complete series, I then generate a new series of commits, 
> each of which does one thing, without any bugs that I've resolved, such 
> that the net result is the end of the messy history, except with any 
> debugging or useless stuff skipped. It's this series that gets merged into 
> the project history, and I discard the other history.
> 
> The real trick is that the early patches in a lot of series often refactor 
> existing code in ways that are generally good and necessary for your 
> eventual outcome, but which you'd never think of until you've written more 
> of the series.

That summarizes well how I do original development, too.  Whether its a 
branch of an existing repo, or a newly cloned repo, when working on new 
code I will do a first pass, committing as I go to provide useful 
checkpoints.

Once I reach a satisfactory state, I'll refactor the patches so that 
they make sense for upstream submission.

	Jeff



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23  2:23         ` J.C. Pizarro
@ 2008-02-23  8:44           ` Alexey Dobriyan
       [not found]           ` <998d0e4a0802221847m431aa136xa217333b0517b962@mail.gmail.com>
  1 sibling, 0 replies; 20+ messages in thread
From: Alexey Dobriyan @ 2008-02-23  8:44 UTC (permalink / raw)
  To: J.C. Pizarro; +Cc: Al Viro, LKML

On Sat, Feb 23, 2008 at 03:23:49AM +0100, J.C. Pizarro wrote:
> On 2008/2/23, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Fri, Feb 22, 2008 at 05:51:04PM -0800, Junio C Hamano wrote:
> >  > Al Viro <viro@ZenIV.linux.org.uk> writes:
> >  >
> >  > > On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:
> >  > >
> >  > >> >do you tend to clone the entire repository repeatedly into a series
> >  > >> >of separate working directories
> >  > >>
> >  > >> Too time consuming on consumer drives with projects the size of Linux.
> >  > >
> >  > > git clone -l -s
> >  > >
> >  > > is not particulary slow...
> >  >
> >  > How big is a checkout of a single revision of kernel these days,
> >  > compared to a well-packed history since v2.6.12-rc2?
> >  >
> >  > The cost of writing out the work tree files isn't ignorable and
> >  > probably more than writing out the repository data (which -s
> >  > saves for you).
> >
> >
> > Depends...  I'm using ext2 for that and noatime everywhere, so that might
> >  change the picture, but IME it's fast enough...  As for the size, it gets
> >  to ~320Mb on disk, which is comparable to the pack size (~240-odd Mb).
> 
> Yesterday, i had git cloned git://foo.com/bar.git   ( 777 MiB )
> Today, i've git cloned git://foo.com/bar.git   ( 779 MiB )
> 
> Both repos are different binaries , and i used 777 MiB + 779 MiB = 1556 MiB
> of bandwidth in two days. It's much!
> 
> Why don't we implement "binary delta between old git repo and recent git repo"
> with "SHA1 built git repo verifier"?
> 
> Suppose the size cost of this binary delta is e.g. around 52 MiB instead of
> 2 MiB due to numerous mismatching of binary parts, then the bandwidth
> in two days will be 777 MiB + 52 MiB = 829 MiB instead of 1556 MiB.
> 
> Unfortunately, this "binary delta of repos" is not implemented yet :|

See git-pull .

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
                   ` (3 preceding siblings ...)
  2008-02-23  4:39 ` Rene Herman
@ 2008-02-23  8:56 ` Willy Tarreau
  2008-02-23  9:10 ` Sam Ravnborg
  5 siblings, 0 replies; 20+ messages in thread
From: Willy Tarreau @ 2008-02-23  8:56 UTC (permalink / raw)
  To: Chase Venters; +Cc: linux-kernel, git

On Fri, Feb 22, 2008 at 06:37:14PM -0600, Chase Venters wrote:
> It seems to me that having multiple working trees (effectively, cloning 
> the "master" repository every time I need to make anything but a trivial 
> change) would be most effective under git as well as it doesn't require 
> creating messy, intermediate commits in the first place (but allows for them 
> if they are used). But I wonder how that approach would scale with a project 
> whose git repo weighed hundreds of megs or more. (With a centralized rcs, of 
> course, you don't have to lug around a copy of the whole project history in 
> each working tree.)

Take a look at git-new-workdir in git's contrib directory. I'm using it a
lot now. It makes it possible to set up as many workdirs as you want, sharing
the same repo. It's very dangerous if you're not rigorous, but it saves a lot
of time when you work on several branches at a time, which is even more true
for a project's documentation. The real thing to care about is not to have
the same branch checked out at several places.

Regards,
Willy


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
                   ` (4 preceding siblings ...)
  2008-02-23  8:56 ` Willy Tarreau
@ 2008-02-23  9:10 ` Sam Ravnborg
  5 siblings, 0 replies; 20+ messages in thread
From: Sam Ravnborg @ 2008-02-23  9:10 UTC (permalink / raw)
  To: Chase Venters; +Cc: linux-kernel, git

On Fri, Feb 22, 2008 at 06:37:14PM -0600, Chase Venters wrote:
> I've been making myself more familiar with git lately and I'm curious what 
> habits others have adopted. (I know there are a few documents in circulation 
> that deal with using git to work on the kernel but I don't think this has 
> been specifically covered).
> 
> My question is: If you're working on multiple things at once, do you tend to 
> clone the entire repository repeatedly into a series of separate working 
> directories and do your work there, then pull that work (possibly comprising 
> a series of "temporary" commits) back into a separate local master 
> respository with --squash, either into "master" or into a branch containing 
> the new feature?

The simple (for me) workflow I use is to create a clone of the
kernel for each 'topic' I work on.
So at the same time I may have one or maybe up to five clones of the
kernel.

When I want to combine thing I use git format-patch and git am.
Often there is some amount of editing done before combining stuff
especially for larger changes where the first in the serie is often
preparational work that were identified in random order when I did
the inital work.

	Sam

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23  4:10 ` Daniel Barkalow
  2008-02-23  5:03   ` Jeff Garzik
@ 2008-02-23  9:18   ` Mike Hommey
  1 sibling, 0 replies; 20+ messages in thread
From: Mike Hommey @ 2008-02-23  9:18 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Chase Venters, linux-kernel, git

On Fri, Feb 22, 2008 at 11:10:48PM -0500, Daniel Barkalow wrote:
> I find that the sequence of changes I make is pretty much unrelated to the 
> sequence of changes that end up in the project's history, because my 
> changes as I make them involve writing a lot of stubs (so I can build) and 
> then filling them out. It's beneficial to have version control on this so 
> that, if I screw up filling out a stub, I can get back to where I was.
> 
> Having made a complete series, I then generate a new series of commits, 
> each of which does one thing, without any bugs that I've resolved, such 
> that the net result is the end of the messy history, except with any 
> debugging or useless stuff skipped. It's this series that gets merged into 
> the project history, and I discard the other history.
> 
> The real trick is that the early patches in a lot of series often refactor 
> existing code in ways that are generally good and necessary for your 
> eventual outcome, but which you'd never think of until you've written more 
> of the series. Generating a new commit sequence is necessary to end up 
> with a history where it looks from the start like you know where you're 
> going and have everything done that needs to be done when you get to the 
> point of needing it. Furthermore, you want to be able to test these 
> commits in isolation, without the distraction of the changes that actually 
> prompted them, which means that you want to have your working tree is a 
> state that you never actually had it in as you were developing the end 
> result.
> 
> This means that you'll usually want to rewrite commits for any series that 
> isn't a single obvious patch, so it's not a big deal to commit any time 
> you want to work on some different branch.

I do that so much that I have this alias:
        reorder = !sh -c 'git rebase -i --onto $0 $0 $1'

... and actually pass it only one argument most of the time.

Mike

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
       [not found]             ` <20080223113952.GA4936@hashpling.org>
@ 2008-02-23 13:08               ` J.C. Pizarro
  2008-02-23 13:17                 ` Charles Bailey
  0 siblings, 1 reply; 20+ messages in thread
From: J.C. Pizarro @ 2008-02-23 13:08 UTC (permalink / raw)
  To: Charles Bailey, LKML, git

On 2008/2/23, Charles Bailey <charles@hashpling.org> wrote:
> On Sat, Feb 23, 2008 at 03:47:07AM +0100, J.C. Pizarro wrote:
>  >
>  > Yesterday, i had git cloned git://foo.com/bar.git   ( 777 MiB )
>  >  Today, i've git cloned git://foo.com/bar.git   ( 779 MiB )
>  >
>  >  Both repos are different binaries , and i used 777 MiB + 779 MiB = 1556 MiB
>  >  of bandwidth in two days. It's much!
>  >
>  >  Why don't we implement "binary delta between old git repo and recent git repo"
>  >  with "SHA1 built git repo verifier"?
>  >
>  >  Suppose the size cost of this binary delta is e.g. around 52 MiB instead of
>  >  2 MiB due to numerous mismatching of binary parts, then the bandwidth
>  >  in two days will be 777 MiB + 52 MiB = 829 MiB instead of 1556 MiB.
>  >
>  >  Unfortunately, this "binary delta of repos" is not implemented yet :|
>
>
> It sounds like what concerns you is the bandwith to git://foo.bar. If
>  you are cloning the first repository to somewhere were the first
>  clone is accessible and bandwidth between the clones is not an issue,
>  then you should be able to use the --reference parameter to git clone
>  to just fetch the missing ~2 MiB from foo.bar.
>
>  A "binary delta of repos" should just be an 'incremental' pack file
>  and the git protocol should support generating an appropriate one. I'm
>  not quite sure what "not implemented yet" feature you are looking for.

But if the repos are aggressively repacked then the bit to bit differences
are not ~2 MiB.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23 13:08               ` J.C. Pizarro
@ 2008-02-23 13:17                 ` Charles Bailey
  2008-02-23 13:36                   ` J.C. Pizarro
  0 siblings, 1 reply; 20+ messages in thread
From: Charles Bailey @ 2008-02-23 13:17 UTC (permalink / raw)
  To: J.C. Pizarro; +Cc: LKML, git

On Sat, Feb 23, 2008 at 02:08:35PM +0100, J.C. Pizarro wrote:
> 
> But if the repos are aggressively repacked then the bit to bit differences
> are not ~2 MiB.

It shouldn't matter how aggressively the repositories are packed or what
the binary differences are between the pack files are. git clone
should (with the --reference option) generate a new pack for you with
only the missing objects. If these objects are ~52 MiB then a lot has
been committed to the repository, but you're not going to be able to
get around a big download any other way.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23 13:17                 ` Charles Bailey
@ 2008-02-23 13:36                   ` J.C. Pizarro
  2008-02-23 14:01                     ` Charles Bailey
  0 siblings, 1 reply; 20+ messages in thread
From: J.C. Pizarro @ 2008-02-23 13:36 UTC (permalink / raw)
  To: Charles Bailey, LKML, git

On 2008/2/23, Charles Bailey <charles@hashpling.org> wrote:
> On Sat, Feb 23, 2008 at 02:08:35PM +0100, J.C. Pizarro wrote:
>  >
>  > But if the repos are aggressively repacked then the bit to bit differences
>  > are not ~2 MiB.
>
>
> It shouldn't matter how aggressively the repositories are packed or what
>  the binary differences are between the pack files are. git clone
>  should (with the --reference option) generate a new pack for you with
>  only the missing objects. If these objects are ~52 MiB then a lot has
>  been committed to the repository, but you're not going to be able to
>  get around a big download any other way.

You're wrong, nothing has to be commited ~52 MiB to the repository.

I'm not saying "commit", i'm saying

"Assume A & B binary git repos and delta_B-A another binary file, i
request built
B' = A + delta_B-A where is verified SHA1(B') = SHA1(B) for avoiding
corrupting".

Assume B is the higher repacked version of "A + minor commits of the day"
as if B was optimizing 24 hours more the minimum spanning tree. Wow!!!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23 13:36                   ` J.C. Pizarro
@ 2008-02-23 14:01                     ` Charles Bailey
  2008-02-23 17:10                       ` J.C. Pizarro
  0 siblings, 1 reply; 20+ messages in thread
From: Charles Bailey @ 2008-02-23 14:01 UTC (permalink / raw)
  To: J.C. Pizarro; +Cc: LKML, git

On Sat, Feb 23, 2008 at 02:36:59PM +0100, J.C. Pizarro wrote:
> On 2008/2/23, Charles Bailey <charles@hashpling.org> wrote:
> >
> > It shouldn't matter how aggressively the repositories are packed or what
> >  the binary differences are between the pack files are. git clone
> >  should (with the --reference option) generate a new pack for you with
> >  only the missing objects. If these objects are ~52 MiB then a lot has
> >  been committed to the repository, but you're not going to be able to
> >  get around a big download any other way.
> 
> You're wrong, nothing has to be commited ~52 MiB to the repository.
> 
> I'm not saying "commit", i'm saying
> 
> "Assume A & B binary git repos and delta_B-A another binary file, i
> request built
> B' = A + delta_B-A where is verified SHA1(B') = SHA1(B) for avoiding
> corrupting".
> 
> Assume B is the higher repacked version of "A + minor commits of the day"
> as if B was optimizing 24 hours more the minimum spanning tree. Wow!!!
> 

I'm not sure that I understand where you are going with this.
Originally, you stated that if you clone a 775 MiB repository on day
one, and then you clone it again on day two when it was 777 MiB, then
you currently have to download 775 + 777 MiB of data, whereas you
could download a 52 MiB binary diff. I have no idea where that value
of 52 MiB comes from, and I've no idea how many objects were committed
between day one and day two. If we're going to talk about details,
then you need to provide more details about your scenario.

Having said that, here is my original point in some more detail. git
repositories are not binary blobs, they are object databases. Better
than this, they are databases of immutable objects. This means that to
get the difference between one database and another, you only need to
add the objects that are missing from the other database. If the two
databases are actually a database and the same database at short time
interval later, then almost all the objects are going to be common and
the difference will be a small set of objects. Using git:// this set
of objects can be efficiently transfered as a pack file. You may have
a corner case scenario where the following isn't true, but in my
experience an incremental pack file will be a more compact
representation of this difference than a binary difference of two
aggressively repacked git repositories as generated by a generic
binary difference engine.

I'm sorry if I've misunderstood your last point. Perhaps you could
expand in the exact issue that are having if I have, as I'm not sure
that I've really answered your last message.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23 14:01                     ` Charles Bailey
@ 2008-02-23 17:10                       ` J.C. Pizarro
  2008-02-23 18:19                         ` J.C. Pizarro
  0 siblings, 1 reply; 20+ messages in thread
From: J.C. Pizarro @ 2008-02-23 17:10 UTC (permalink / raw)
  To: Charles Bailey, LKML, git

On 2008/2/23, Charles Bailey <charles@hashpling.org> wrote:
> On Sat, Feb 23, 2008 at 02:36:59PM +0100, J.C. Pizarro wrote:
>  > On 2008/2/23, Charles Bailey <charles@hashpling.org> wrote:
>  > >
>
> > > It shouldn't matter how aggressively the repositories are packed or what
>  > >  the binary differences are between the pack files are. git clone
>  > >  should (with the --reference option) generate a new pack for you with
>  > >  only the missing objects. If these objects are ~52 MiB then a lot has
>  > >  been committed to the repository, but you're not going to be able to
>  > >  get around a big download any other way.
>  >
>  > You're wrong, nothing has to be commited ~52 MiB to the repository.
>  >
>  > I'm not saying "commit", i'm saying
>  >
>  > "Assume A & B binary git repos and delta_B-A another binary file, i
>  > request built
>  > B' = A + delta_B-A where is verified SHA1(B') = SHA1(B) for avoiding
>  > corrupting".
>  >
>  > Assume B is the higher repacked version of "A + minor commits of the day"
>  > as if B was optimizing 24 hours more the minimum spanning tree. Wow!!!
>  >
>
>
> I'm not sure that I understand where you are going with this.
>  Originally, you stated that if you clone a 775 MiB repository on day
>  one, and then you clone it again on day two when it was 777 MiB, then
>  you currently have to download 775 + 777 MiB of data, whereas you
>  could download a 52 MiB binary diff. I have no idea where that value
>  of 52 MiB comes from, and I've no idea how many objects were committed
>  between day one and day two. If we're going to talk about details,
>  then you need to provide more details about your scenario.

I don't said that "A & B binary git repos" are binary files, but i said that
delta_B-A is a binary file.

I said ago ~15 hours "Suppose the size cost of this binary delta is e.g. around
52 MiB instead of 2 MiB due to numerous mismatching of binary parts ..."

The binary delta is different to the textual delta (between lines of texts)
 used in the git scheme (the commits or changesets use textual deltas).
The textual delta can be compressed resulting a smaller binary object.
Collecting binary objects and some more is the git repository.
You can't apply textual delta of git repository, only binary delta.
You can apply binary delta of both git-repacked repositories if there
is a program
 that generates binary delta of both directories but it's not implement yet.
The SHA1 verifier is useful for avoid the corrupting of the generated repository
 (if it's corrupted then it has to be cloned again delta or whole
until non-corrupted).
An example of same SHA1 of both directories can be implemented as same SHA1
 of sorted SHA1s of contents, filenames and properties. Anything
alterated, added
 or eliminated from them implies different SHA1.

Don't you understand i'm saying? I will give you a practical example.
1. zip -r -8  foo1.zip foo1  # in foo1 there are tons of information
as from git repo
2. mv foo1 foo2 ; cp bar.txt foo2/
3. zip -r -9 foo2.zip foo2   # still little bit more optimized (=
higher repacked)
4. Apply binary delta between foo1.zip & foo2.zip with a supposed program
     deltaier and you get delta_foo1_foo2.bin. The size(delta_foo1_foo2.bin) is
     not nearly ~( size(foo2.zip) - size(foo1.zip) )
5. Apply hexadecimal diff and you will understand why it gives the exemplar
     ~52 MiB instead of ~2 MiB that i said it.
6. You will know some identical parts in both foo1.zip and foo2.zip.
     Identical parts are good for smaller binary deltas. It's possible to get
     still smaller binary deltas when their identical parts are in
random offsets
     or random locations depending of how deltaier program is advanced.
7. Same above but instead of both files, apply binary delta of both directories.

>  Having said that, here is my original point in some more detail. git
>  repositories are not binary blobs, they are object databases. Better
>  than this, they are databases of immutable objects. This means that to
>  get the difference between one database and another, you only need to
>  add the objects that are missing from the other database.

Databases of immutable objects <--- You're wrong because you confuse.
There are mutable objects as the better deltas of min. spanning tree.

The missing objects are not only the missing sources that you're thinking,
they can be any thing (blob, tree, commit, tag, etc.). The deltas of the
minimum spanning tree too are objects of the database that can be erased
or added when the spanning tree is alterated (because the alterated spanning
tree is smaller than previous) for better repack. Best repack is still
NP-problem
and to solve this bigger NP-problem of each day is 24/365 (eternal computing).

The git database is the top-level ".git/" directory but it has repacked binary
information and has always some size measured normally in MiBs that i was
saying above.

>                                                                        If the two
>  databases are actually a database and the same database at short time
>  interval later, then almost all the objects are going to be common and
>  the difference will be a small set of objects. Using git:// this set
>  of objects can be efficiently transfered as a pack file.

You're saying    repacked(A) + new objects   with the bandwith cost of
new objects
but i'm saying  rerepacked(A+new objects)   with the bandwith cost of
binary delta
                                   where delta is repacked(A) -
rerepacked(A+new objects)
                                         and rerepacked(X) is more
time repacking again X.

>                                                                                     You may have
>  a corner case scenario where the following isn't true, but in my
>  experience an incremental pack file will be a more compact
>  representation of this difference than a binary difference of two
>  aggressively repacked git repositories as generated by a generic
>  binary difference engine.

Yes, it's more simple and compact, but the eternal repacking 24/365 can do it
 e.g. 30% smaller after few weeks when the incremental pack has made nothing.

It's good idea that the weekly user picks the binary delta and the
daily developer
 picks the incremental pack. Put both modes working in the git server.

>  I'm sorry if I've misunderstood your last point. Perhaps you could
>  expand in the exact issue that are having if I have, as I'm not sure
>  that I've really answered your last message.

   Misunderstood can be dissappeared ;)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about your git habits
  2008-02-23 17:10                       ` J.C. Pizarro
@ 2008-02-23 18:19                         ` J.C. Pizarro
  0 siblings, 0 replies; 20+ messages in thread
From: J.C. Pizarro @ 2008-02-23 18:19 UTC (permalink / raw)
  To: LKML, git

The google's gmail made a crap my last message that it did wrap
my message of X lines to the crap of (X+o) lines misconfiguring
my original lines of the message.

    I don't see the motives of Google crapping my original lines
    of the messages that i had sended.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2008-02-23 18:19 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-23  0:37 Question about your git habits Chase Venters
2008-02-23  1:36 ` J.C. Pizarro
2008-02-23  1:37 ` Jan Engelhardt
2008-02-23  1:44   ` Al Viro
2008-02-23  1:51     ` Junio C Hamano
2008-02-23  2:09       ` Al Viro
2008-02-23  2:23         ` J.C. Pizarro
2008-02-23  8:44           ` Alexey Dobriyan
     [not found]           ` <998d0e4a0802221847m431aa136xa217333b0517b962@mail.gmail.com>
     [not found]             ` <20080223113952.GA4936@hashpling.org>
2008-02-23 13:08               ` J.C. Pizarro
2008-02-23 13:17                 ` Charles Bailey
2008-02-23 13:36                   ` J.C. Pizarro
2008-02-23 14:01                     ` Charles Bailey
2008-02-23 17:10                       ` J.C. Pizarro
2008-02-23 18:19                         ` J.C. Pizarro
2008-02-23  4:10 ` Daniel Barkalow
2008-02-23  5:03   ` Jeff Garzik
2008-02-23  9:18   ` Mike Hommey
2008-02-23  4:39 ` Rene Herman
2008-02-23  8:56 ` Willy Tarreau
2008-02-23  9:10 ` Sam Ravnborg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).