You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

DeltaWindow.java 14KB

Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Fix missing deltas near type boundaries Delta search was discarding discovered deltas if an object appeared near a type boundary in the delta search window. This has caused JGit to produce larger pack files than other implementations of the packing algorithm. Delta search works by pushing prior objects into a search window, an ordered list of objects to attempt to delta compress the next object against. (The window size is bounded, avoiding O(N^2) behavior.) For implementation reasons multiple object types can appear in the input list, and the window. PackWriter commonly passes both trees and blobs in the input list handed to the DeltaWindow algorithm. The pack file format requires an object to only delta compress against the same type, so the DeltaWindow algorithm must stop doing comparisions if a blob would be compared to a tree. Because the input list is sorted by object type and the window is recently considered prior objects, once a wrong type is discovered in the window the search algorithm stops and uses the current result. Unfortunately the termination condition was discarding any found delta by setting deltaBase and deltaBuf to null when it was trying to break the window search. When this bug occurs, the state of the DeltaWindow looks like this: current | \ / input list: tree0 tree1 blob1 blob2 window: blob1 tree1 tree0 / \ | res.prev As the loop iterates to the right across the window, it first finds that blob1 is a suitable delta base for blob2, and temporarily holds this in the bestDelta/deltaBuf fields. It then considers tree1, but tree1 has the wrong type (blob != tree), so the window loop must give up and fall through the remaining code. Moving the condition up and discarding the window contents allows the bestDelta/deltaBuf to be kept, letting the final file delta compress blob1 against blob0. The impact of this bug (and its fix) on real world repositories is likely minimal. The boundary from blob to tree happens approximately once in the search, as the input list is sorted by type. Only the first window size worth of blobs (e.g. 10 or 250) were failing to produce a delta in the final file. This bug fix does produce significantly different results for small test repositories created in the unit test suite, such as when a pack may contains 6 objects (2 commits, 2 trees, 2 blobs). Packing test cases can now better sample different output pack file sizes depending on delta compression and object reuse flags in PackConfig. Change-Id: Ibec09398d0305d4dbc0c66fce1daaf38eb71148f
7 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Fix missing deltas near type boundaries Delta search was discarding discovered deltas if an object appeared near a type boundary in the delta search window. This has caused JGit to produce larger pack files than other implementations of the packing algorithm. Delta search works by pushing prior objects into a search window, an ordered list of objects to attempt to delta compress the next object against. (The window size is bounded, avoiding O(N^2) behavior.) For implementation reasons multiple object types can appear in the input list, and the window. PackWriter commonly passes both trees and blobs in the input list handed to the DeltaWindow algorithm. The pack file format requires an object to only delta compress against the same type, so the DeltaWindow algorithm must stop doing comparisions if a blob would be compared to a tree. Because the input list is sorted by object type and the window is recently considered prior objects, once a wrong type is discovered in the window the search algorithm stops and uses the current result. Unfortunately the termination condition was discarding any found delta by setting deltaBase and deltaBuf to null when it was trying to break the window search. When this bug occurs, the state of the DeltaWindow looks like this: current | \ / input list: tree0 tree1 blob1 blob2 window: blob1 tree1 tree0 / \ | res.prev As the loop iterates to the right across the window, it first finds that blob1 is a suitable delta base for blob2, and temporarily holds this in the bestDelta/deltaBuf fields. It then considers tree1, but tree1 has the wrong type (blob != tree), so the window loop must give up and fall through the remaining code. Moving the condition up and discarding the window contents allows the bestDelta/deltaBuf to be kept, letting the final file delta compress blob1 against blob0. The impact of this bug (and its fix) on real world repositories is likely minimal. The boundary from blob to tree happens approximately once in the search, as the input list is sorted by type. Only the first window size worth of blobs (e.g. 10 or 250) were failing to produce a delta in the final file. This bug fix does produce significantly different results for small test repositories created in the unit test suite, such as when a pack may contains 6 objects (2 commits, 2 trees, 2 blobs). Packing test cases can now better sample different output pack file sizes depending on delta compression and object reuse flags in PackConfig. Change-Id: Ibec09398d0305d4dbc0c66fce1daaf38eb71148f
7 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Correct distribution of allowed delta size along chain length Nicolas Pitre discovered a very simple rule for selecting between two different delta base candidates: - if based whole object, must be <= 50% of target - if at end of a chain, must be <= 1/depth * 50% of target The rule penalizes deltas near the end of the chain, requiring them to be very small in order to be kept by the packer. This favors deltas that are based on a shorter chain, where the read-time unpack cost is much lower. Fewer bytes need to be consulted from the source pack file, and less copying is required in memory to rebuild the object. Junio Hamano explained Nico's rule to me today, and this commit fixes DeltaWindow to implement it as described. When no base has been chosen the computation is simply the statements denoted above. However once a base with depth of 9 has been chosen (e.g. when pack.depth is limited to 10), a non-delta source may create a new delta that is up to 10x larger than the already selected base. This reflects the intent of Nico's size distribution rule no matter what order objects are visited in the DeltaWindow. With this patch and my other patches applied, repacking JGit with: [pack] reuseObjects = false reuseDeltas = false depth = 50 window = 250 threads = 4 compression = 9 CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1] JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2] rest 9,809 bytes The improved JGit result for the head pack is only 6.4 KiB larger than CGit's resulting pack. This patch allowed JGit to find an additional 39.7 KiB worth of space savings. JGit now also often runs 2s faster than CGit, despite also creating bitmaps and pruning objects after the head pack creation. [1] time git repack -a -d -F --window=250 --depth=50 [2] time java -Xmx128m -jar jgit debug-gc Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513
  1. /*
  2. * Copyright (C) 2010, Google Inc.
  3. * and other copyright owners as documented in the project's IP log.
  4. *
  5. * This program and the accompanying materials are made available
  6. * under the terms of the Eclipse Distribution License v1.0 which
  7. * accompanies this distribution, is reproduced below, and is
  8. * available at http://www.eclipse.org/org/documents/edl-v10.php
  9. *
  10. * All rights reserved.
  11. *
  12. * Redistribution and use in source and binary forms, with or
  13. * without modification, are permitted provided that the following
  14. * conditions are met:
  15. *
  16. * - Redistributions of source code must retain the above copyright
  17. * notice, this list of conditions and the following disclaimer.
  18. *
  19. * - Redistributions in binary form must reproduce the above
  20. * copyright notice, this list of conditions and the following
  21. * disclaimer in the documentation and/or other materials provided
  22. * with the distribution.
  23. *
  24. * - Neither the name of the Eclipse Foundation, Inc. nor the
  25. * names of its contributors may be used to endorse or promote
  26. * products derived from this software without specific prior
  27. * written permission.
  28. *
  29. * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
  30. * CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
  31. * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  32. * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  33. * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
  34. * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  35. * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  36. * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  37. * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  38. * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
  39. * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  40. * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  41. * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  42. */
  43. package org.eclipse.jgit.internal.storage.pack;
  44. import java.io.EOFException;
  45. import java.io.IOException;
  46. import java.io.OutputStream;
  47. import java.util.zip.Deflater;
  48. import org.eclipse.jgit.errors.IncorrectObjectTypeException;
  49. import org.eclipse.jgit.errors.LargeObjectException;
  50. import org.eclipse.jgit.errors.MissingObjectException;
  51. import org.eclipse.jgit.lib.ObjectReader;
  52. import org.eclipse.jgit.lib.ProgressMonitor;
  53. import org.eclipse.jgit.storage.pack.PackConfig;
  54. import org.eclipse.jgit.util.TemporaryBuffer;
  55. final class DeltaWindow {
  56. private static final boolean NEXT_RES = false;
  57. private static final boolean NEXT_SRC = true;
  58. private final PackConfig config;
  59. private final DeltaCache deltaCache;
  60. private final ObjectReader reader;
  61. private final ProgressMonitor monitor;
  62. private final long bytesPerUnit;
  63. private long bytesProcessed;
  64. /** Maximum number of bytes to admit to the window at once. */
  65. private final long maxMemory;
  66. /** Maximum depth we should create for any delta chain. */
  67. private final int maxDepth;
  68. private final ObjectToPack[] toSearch;
  69. private int cur;
  70. private int end;
  71. /** Amount of memory we have loaded right now. */
  72. private long loaded;
  73. // The object we are currently considering needs a lot of state:
  74. /** Window entry of the object we are currently considering. */
  75. private DeltaWindowEntry res;
  76. /** If we have chosen a base, the window entry it was created from. */
  77. private DeltaWindowEntry bestBase;
  78. private int deltaLen;
  79. private Object deltaBuf;
  80. /** Used to compress cached deltas. */
  81. private Deflater deflater;
  82. DeltaWindow(PackConfig pc, DeltaCache dc, ObjectReader or,
  83. ProgressMonitor pm, long bpu,
  84. ObjectToPack[] in, int beginIndex, int endIndex) {
  85. config = pc;
  86. deltaCache = dc;
  87. reader = or;
  88. monitor = pm;
  89. bytesPerUnit = bpu;
  90. toSearch = in;
  91. cur = beginIndex;
  92. end = endIndex;
  93. maxMemory = Math.max(0, config.getDeltaSearchMemoryLimit());
  94. maxDepth = config.getMaxDeltaDepth();
  95. res = DeltaWindowEntry.createWindow(config.getDeltaSearchWindowSize());
  96. }
  97. synchronized DeltaTask.Slice remaining() {
  98. int e = end;
  99. int halfRemaining = (e - cur) >>> 1;
  100. if (0 == halfRemaining)
  101. return null;
  102. int split = e - halfRemaining;
  103. int h = toSearch[split].getPathHash();
  104. // Attempt to split on the next path after the 50% split point.
  105. for (int n = split + 1; n < e; n++) {
  106. if (h != toSearch[n].getPathHash())
  107. return new DeltaTask.Slice(n, e);
  108. }
  109. if (h != toSearch[cur].getPathHash()) {
  110. // Try to split on the path before the 50% split point.
  111. // Do not split the path currently being processed.
  112. for (int p = split - 1; cur < p; p--) {
  113. if (h != toSearch[p].getPathHash())
  114. return new DeltaTask.Slice(p + 1, e);
  115. }
  116. }
  117. return null;
  118. }
  119. synchronized boolean tryStealWork(DeltaTask.Slice s) {
  120. if (s.beginIndex <= cur || end <= s.beginIndex)
  121. return false;
  122. end = s.beginIndex;
  123. return true;
  124. }
  125. void search() throws IOException {
  126. try {
  127. for (;;) {
  128. ObjectToPack next;
  129. synchronized (this) {
  130. if (end <= cur)
  131. break;
  132. next = toSearch[cur++];
  133. }
  134. if (maxMemory != 0) {
  135. clear(res);
  136. final long need = estimateSize(next);
  137. DeltaWindowEntry n = res.next;
  138. for (; maxMemory < loaded + need && n != res; n = n.next)
  139. clear(n);
  140. }
  141. res.set(next);
  142. clearWindowOnTypeSwitch();
  143. if (res.object.isEdge() || res.object.doNotAttemptDelta()) {
  144. // We don't actually want to make a delta for
  145. // them, just need to push them into the window
  146. // so they can be read by other objects.
  147. keepInWindow();
  148. } else {
  149. // Search for a delta for the current window slot.
  150. if (bytesPerUnit <= (bytesProcessed += next.getWeight())) {
  151. int d = (int) (bytesProcessed / bytesPerUnit);
  152. monitor.update(d);
  153. bytesProcessed -= d * bytesPerUnit;
  154. }
  155. searchInWindow();
  156. }
  157. }
  158. } finally {
  159. if (deflater != null)
  160. deflater.end();
  161. }
  162. }
  163. private static long estimateSize(ObjectToPack ent) {
  164. return DeltaIndex.estimateIndexSize(ent.getWeight());
  165. }
  166. private static long estimateIndexSize(DeltaWindowEntry ent) {
  167. if (ent.buffer == null)
  168. return estimateSize(ent.object);
  169. int len = ent.buffer.length;
  170. return DeltaIndex.estimateIndexSize(len) - len;
  171. }
  172. private void clearWindowOnTypeSwitch() {
  173. DeltaWindowEntry p = res.prev;
  174. if (!p.empty() && res.type() != p.type()) {
  175. for (; p != res; p = p.prev) {
  176. clear(p);
  177. }
  178. }
  179. }
  180. private void clear(DeltaWindowEntry ent) {
  181. if (ent.index != null)
  182. loaded -= ent.index.getIndexSize();
  183. else if (ent.buffer != null)
  184. loaded -= ent.buffer.length;
  185. ent.set(null);
  186. }
  187. private void searchInWindow() throws IOException {
  188. // Loop through the window backwards, considering every entry.
  189. // This lets us look at the bigger objects that came before.
  190. for (DeltaWindowEntry src = res.prev; src != res; src = src.prev) {
  191. if (src.empty())
  192. break;
  193. if (delta(src) /* == NEXT_SRC */)
  194. continue;
  195. bestBase = null;
  196. deltaBuf = null;
  197. return;
  198. }
  199. // We couldn't find a suitable delta for this object, but it may
  200. // still be able to act as a base for another one.
  201. if (bestBase == null) {
  202. keepInWindow();
  203. return;
  204. }
  205. // Select this best matching delta as the base for the object.
  206. //
  207. ObjectToPack srcObj = bestBase.object;
  208. ObjectToPack resObj = res.object;
  209. if (srcObj.isEdge()) {
  210. // The source (the delta base) is an edge object outside of the
  211. // pack. Its part of the common base set that the peer already
  212. // has on hand, so we don't want to send it. We have to store
  213. // an ObjectId and *NOT* an ObjectToPack for the base to ensure
  214. // the base isn't included in the outgoing pack file.
  215. resObj.setDeltaBase(srcObj.copy());
  216. } else {
  217. // The base is part of the pack we are sending, so it should be
  218. // a direct pointer to the base.
  219. resObj.setDeltaBase(srcObj);
  220. }
  221. int depth = srcObj.getDeltaDepth() + 1;
  222. resObj.setDeltaDepth(depth);
  223. resObj.clearReuseAsIs();
  224. cacheDelta(srcObj, resObj);
  225. if (depth < maxDepth) {
  226. // Reorder the window so that the best base will be tested
  227. // first for the next object, and the current object will
  228. // be the second candidate to consider before any others.
  229. res.makeNext(bestBase);
  230. res = bestBase.next;
  231. }
  232. bestBase = null;
  233. deltaBuf = null;
  234. }
  235. private boolean delta(DeltaWindowEntry src)
  236. throws IOException {
  237. // If the sizes are radically different, this is a bad pairing.
  238. if (res.size() < src.size() >>> 4)
  239. return NEXT_SRC;
  240. int msz = deltaSizeLimit(src);
  241. if (msz <= 8) // Nearly impossible to fit useful delta.
  242. return NEXT_SRC;
  243. // If we have to insert a lot to make this work, find another.
  244. if (res.size() - src.size() > msz)
  245. return NEXT_SRC;
  246. DeltaIndex srcIndex;
  247. try {
  248. srcIndex = index(src);
  249. } catch (LargeObjectException tooBig) {
  250. // If the source is too big to work on, skip it.
  251. return NEXT_SRC;
  252. } catch (IOException notAvailable) {
  253. if (src.object.isEdge()) // Missing edges are OK.
  254. return NEXT_SRC;
  255. throw notAvailable;
  256. }
  257. byte[] resBuf;
  258. try {
  259. resBuf = buffer(res);
  260. } catch (LargeObjectException tooBig) {
  261. // If its too big, move on to another item.
  262. return NEXT_RES;
  263. }
  264. try {
  265. OutputStream delta = msz <= (8 << 10)
  266. ? new ArrayStream(msz)
  267. : new TemporaryBuffer.Heap(msz);
  268. if (srcIndex.encode(delta, resBuf, msz))
  269. selectDeltaBase(src, delta);
  270. } catch (IOException deltaTooBig) {
  271. // Unlikely, encoder should see limit and return false.
  272. }
  273. return NEXT_SRC;
  274. }
  275. private void selectDeltaBase(DeltaWindowEntry src, OutputStream delta) {
  276. bestBase = src;
  277. if (delta instanceof ArrayStream) {
  278. ArrayStream a = (ArrayStream) delta;
  279. deltaBuf = a.buf;
  280. deltaLen = a.cnt;
  281. } else {
  282. TemporaryBuffer.Heap b = (TemporaryBuffer.Heap) delta;
  283. deltaBuf = b;
  284. deltaLen = (int) b.length();
  285. }
  286. }
  287. private int deltaSizeLimit(DeltaWindowEntry src) {
  288. if (bestBase == null) {
  289. // Any delta should be no more than 50% of the original size
  290. // (for text files deflate of whole form should shrink 50%).
  291. int n = res.size() >>> 1;
  292. // Evenly distribute delta size limits over allowed depth.
  293. // If src is non-delta (depth = 0), delta <= 50% of original.
  294. // If src is almost at limit (9/10), delta <= 10% of original.
  295. return n * (maxDepth - src.depth()) / maxDepth;
  296. }
  297. // With a delta base chosen any new delta must be "better".
  298. // Retain the distribution described above.
  299. int d = bestBase.depth();
  300. int n = deltaLen;
  301. // If src is whole (depth=0) and base is near limit (depth=9/10)
  302. // any delta using src can be 10x larger and still be better.
  303. //
  304. // If src is near limit (depth=9/10) and base is whole (depth=0)
  305. // a new delta dependent on src must be 1/10th the size.
  306. return n * (maxDepth - src.depth()) / (maxDepth - d);
  307. }
  308. private void cacheDelta(ObjectToPack srcObj, ObjectToPack resObj) {
  309. if (deltaCache.canCache(deltaLen, srcObj, resObj)) {
  310. try {
  311. byte[] zbuf = new byte[deflateBound(deltaLen)];
  312. ZipStream zs = new ZipStream(deflater(), zbuf);
  313. if (deltaBuf instanceof byte[])
  314. zs.write((byte[]) deltaBuf, 0, deltaLen);
  315. else
  316. ((TemporaryBuffer.Heap) deltaBuf).writeTo(zs, null);
  317. deltaBuf = null;
  318. int len = zs.finish();
  319. resObj.setCachedDelta(deltaCache.cache(zbuf, len, deltaLen));
  320. resObj.setCachedSize(deltaLen);
  321. } catch (IOException | OutOfMemoryError err) {
  322. deltaCache.credit(deltaLen);
  323. }
  324. }
  325. }
  326. private static int deflateBound(int insz) {
  327. return insz + ((insz + 7) >> 3) + ((insz + 63) >> 6) + 11;
  328. }
  329. private void keepInWindow() {
  330. res = res.next;
  331. }
  332. private DeltaIndex index(DeltaWindowEntry ent)
  333. throws MissingObjectException, IncorrectObjectTypeException,
  334. IOException, LargeObjectException {
  335. DeltaIndex idx = ent.index;
  336. if (idx == null) {
  337. checkLoadable(ent, estimateIndexSize(ent));
  338. try {
  339. idx = new DeltaIndex(buffer(ent));
  340. } catch (OutOfMemoryError noMemory) {
  341. LargeObjectException.OutOfMemory e;
  342. e = new LargeObjectException.OutOfMemory(noMemory);
  343. e.setObjectId(ent.object);
  344. throw e;
  345. }
  346. if (maxMemory != 0)
  347. loaded += idx.getIndexSize() - idx.getSourceSize();
  348. ent.index = idx;
  349. }
  350. return idx;
  351. }
  352. private byte[] buffer(DeltaWindowEntry ent) throws MissingObjectException,
  353. IncorrectObjectTypeException, IOException, LargeObjectException {
  354. byte[] buf = ent.buffer;
  355. if (buf == null) {
  356. checkLoadable(ent, ent.size());
  357. buf = PackWriter.buffer(config, reader, ent.object);
  358. if (maxMemory != 0)
  359. loaded += buf.length;
  360. ent.buffer = buf;
  361. }
  362. return buf;
  363. }
  364. private void checkLoadable(DeltaWindowEntry ent, long need) {
  365. if (maxMemory == 0)
  366. return;
  367. DeltaWindowEntry n = res.next;
  368. for (; maxMemory < loaded + need; n = n.next) {
  369. clear(n);
  370. if (n == ent)
  371. throw new LargeObjectException.ExceedsLimit(
  372. maxMemory, loaded + need);
  373. }
  374. }
  375. private Deflater deflater() {
  376. if (deflater == null)
  377. deflater = new Deflater(config.getCompressionLevel());
  378. else
  379. deflater.reset();
  380. return deflater;
  381. }
  382. static final class ZipStream extends OutputStream {
  383. private final Deflater deflater;
  384. private final byte[] zbuf;
  385. private int outPtr;
  386. ZipStream(Deflater deflater, byte[] zbuf) {
  387. this.deflater = deflater;
  388. this.zbuf = zbuf;
  389. }
  390. int finish() throws IOException {
  391. deflater.finish();
  392. for (;;) {
  393. if (outPtr == zbuf.length)
  394. throw new EOFException();
  395. int n = deflater.deflate(zbuf, outPtr, zbuf.length - outPtr);
  396. if (n == 0) {
  397. if (deflater.finished())
  398. return outPtr;
  399. throw new IOException();
  400. }
  401. outPtr += n;
  402. }
  403. }
  404. @Override
  405. public void write(byte[] b, int off, int len) throws IOException {
  406. deflater.setInput(b, off, len);
  407. for (;;) {
  408. if (outPtr == zbuf.length)
  409. throw new EOFException();
  410. int n = deflater.deflate(zbuf, outPtr, zbuf.length - outPtr);
  411. if (n == 0) {
  412. if (deflater.needsInput())
  413. break;
  414. throw new IOException();
  415. }
  416. outPtr += n;
  417. }
  418. }
  419. @Override
  420. public void write(int b) throws IOException {
  421. throw new UnsupportedOperationException();
  422. }
  423. }
  424. static final class ArrayStream extends OutputStream {
  425. final byte[] buf;
  426. int cnt;
  427. ArrayStream(int max) {
  428. buf = new byte[max];
  429. }
  430. @Override
  431. public void write(int b) throws IOException {
  432. if (cnt == buf.length)
  433. throw new IOException();
  434. buf[cnt++] = (byte) b;
  435. }
  436. @Override
  437. public void write(byte[] b, int off, int len) throws IOException {
  438. if (len > buf.length - cnt)
  439. throw new IOException();
  440. System.arraycopy(b, off, buf, cnt, len);
  441. cnt += len;
  442. }
  443. }
  444. }