my fork of dmp
Merge pull request #9 from AnubhabB/better-diff
Non-algorithmic improvements to `Diff` for massive performance gains!
Anubhab Bandyopadhyay 2025-01-03
parent 05a688a · parent 41166ab · commit 074fbd1
-rw-r--r--CHANGELOG.md6
-rw-r--r--Cargo.toml4
-rw-r--r--README.md39
-rw-r--r--examples/compat.rs2
-rw-r--r--examples/delta.rs2
-rw-r--r--examples/efficiency.rs3
-rw-r--r--src/dmp.rs779
-rw-r--r--src/lib.rs39
8 files changed, 493 insertions, 381 deletions
diff --git a/CHANGELOG.md b/CHANGELOG.md
index b53acc3..13253eb 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,11 @@
# CHANGELOG.md
+## 0.4.0
+
+Performance:
+
+ - Non-algorithmic improvements to `Diff` implementation resulting in `~41%` improvements over previous [benchmarks](). [Here's]() a writeup of the experiments.
+
## 0.3.2
Fix:
diff --git a/Cargo.toml b/Cargo.toml
index 3cc35cb..81ed106 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -1,11 +1,11 @@
[package]
name = "diff-match-patch-rs"
-version = "0.3.2"
+version = "0.4.0"
edition = "2021"
authors = ["Anubhab Bandyopadhyay"]
homepage = "https://docs.rs/diff-match-patch-rs"
repository = "https://github.com/AnubhabB/diff-match-patch-rs.git"
-description = "A high-performance port of Myer's diff algorithm to perform the operations required for synchronizing plain text."
+description = "The fastest implementation of Myer's diff algorithm to perform the operations required for synchronizing plain text."
readme = "README.md"
license = "MIT OR Apache-2.0"
keywords = ["diff", "match", "patch", "text-synchronization"]
diff --git a/README.md b/README.md
index 32b446a..c4e3316 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@
[<img alt="crates.io" src="https://img.shields.io/crates/v/diff-match-patch-rs" height="20">](https://crates.io/crates/diff-match-patch-rs)
[<img alt="docs.rs" src="https://img.shields.io/badge/docs.rs-diff_match_patch_rs?style=for-the-badge&logo=docs.rs&labelColor=%23555555" height="20">](https://docs.rs/diff-match-patch-rs)
-A very **fast**, **accurate** and **wasm ready** port of [Diff Match Patch](https://github.com/dmsnell/diff-match-patch) in Rust. The diff implementation is based on [Myers' diff algorithm](https://neil.fraser.name/writing/diff/myers.pdf).
+**Fastest**, **accurate** and **wasm ready** implementation of [Diff Match Patch](https://github.com/dmsnell/diff-match-patch) in Rust based on [Myers' diff algorithm](https://neil.fraser.name/writing/diff/myers.pdf).
## Highlights of this crate
- Exposes two modes of operating with `diff-match-patch`, a `Efficient` mode and `Compat` mode. While the `Efficient` mode squeezes out the max performance the `Compat` mode ensures compatibility across other libraries or implementations (rust or otherwise). According to [Benchmarks](#benchmarks), our slower `Compat` mode is still faster than other implementations in rust.
@@ -17,11 +17,29 @@ A very **fast**, **accurate** and **wasm ready** port of [Diff Match Patch](http
- Added a `fuzzer` for sanity
- Exposes the same APIs as [Diff Match Patch](https://github.com/dmsnell/diff-match-patch) with minor changes to make it more idiomatic in Rust.
+## Benchmarks
+Benchmarks are maintained [diff-match-patch-bench repository](https://github.com/AnubhabB/diff-match-patch-rs-bench)
+
+| Lang. | Library | Diff Avg. | Patch Avg. | Bencher | Mode | Correct |
+|:-------:|:----------------------------------------------------------------------------------------:|:---------:|:----------:|:----------:|:-----------:|:-------:|
+| `rust` | [diff_match_patch v0.1.1](https://crates.io/crates/diff_match_patch)[^1] | 56.618 ms | 9.00 ms | Criterion | - | ✅ |
+| `rust` | [dmp v0.2.0](https://crates.io/crates/dmp) | 56.609 ms | 12.25 ms | Criterion | - | ✅ |
+| `rust` | [diff-match-patch-rs](https://github.com/AnubhabB/diff-match-patch-rs.git)<sup>our</sup> | 32.795 ms | 533.17 µs | Criterion | `Efficient`| ✅ |
+| `rust` | [diff-match-patch-rs](https://github.com/AnubhabB/diff-match-patch-rs.git)<sup>our</sup> | 33.222 ms | 993.90 µs | Criterion | `Compat` | ✅ |
+| `go` | [go-diff](https://github.com/sergi/go-diff)[^2] | 30.31 ms | 118.2 ms | go test | - | ❗ |
+| `node` | [diff-match-patch](https://www.npmjs.com/package/diff-match-patch)[^3] | 246.90 ms | 1.07 ms | tinybench | - | ❌ |
+| `python`| [diff-match-patch](https://pypi.org/project/diff-match-patch/) | 1.01 s | 0.25 ms | timeit | - | ✅ |
+
+
+[^1]: [diff_match_patch](https://crates.io/crates/diff_match_patch) Adds an extra clone to the iterator because the `patch_apply` method takes mutable refc. to `diffs`.
+[^2]: [go-diff](https://github.com/sergi/go-diff) is technically a correct implementation but it generates far more diffs than any other implementation I've tested. E.g. In our test data [here](https://github.com/AnubhabB/diff-match-patch-rs-bench/testdata) the go lib ends up generating `~3000` diffs while other implementations are generating `~2300` diffs. My guess is one of the cleanup passes are skipped by `go-diff`.
+[^3]: [diff-match-patch](https://www.npmjs.com/package/diff-match-patch) generated `patch text` and `delta` breaks on `unicode surrogates`.
+
## Usage Examples
```toml
[dependencies]
-diff-match-patch-rs = "0.3.2"
+diff-match-patch-rs = "0.4.0"
```
### `Effitient` mode
@@ -175,23 +193,6 @@ Please checkout the `examples` directory of the [source repo](https://github.com
<div class="warning">The `Effitient` and `Compat` modes are mutually exclusive and will not generate correct output if used interchangibly at source and destination</div>
-## Benchmarks
-Benchmarks are maintained [diff-match-patch-bench repository](https://github.com/AnubhabB/diff-match-patch-rs-bench)
-
-| Lang. | Library | Diff Avg. | Patch Avg. | Bencher | Mode | Correct |
-|:-------:|:----------------------------------------------------------------------------------------:|:---------:|:----------:|:----------:|:-----------:|:-------:|
-| `rust` | [diff_match_patch v0.1.1](https://crates.io/crates/diff_match_patch)[^2] | 68.108 ms | 10.596 ms | Criterion | - | ✅ |
-| `rust` | [dmp v0.2.0](https://crates.io/crates/dmp) | 69.019 ms | 14.654 ms | Criterion | - | ✅ |
-| `rust` | [diff-match-patch-rs](https://github.com/AnubhabB/diff-match-patch-rs.git)<sup>our</sup> | 64.66 ms | 631.13 µs | Criterion | `Efficient` | ✅ |
-| `rust` | [diff-match-patch-rs](https://github.com/AnubhabB/diff-match-patch-rs.git)<sup>our</sup> | 64.68 ms | 1.1703 ms | Criterion | `Compat` | ✅ |
-| `go` | [go-diff](https://github.com/sergi/go-diff) | 50.31 ms | 135.2 ms | go test | - | ✅ |
-| `node` | [diff-match-patch](https://www.npmjs.com/package/diff-match-patch)[^1] | 246.90 ms | 1.07 ms | tinybench | - | ❌ |
-| `python`| [diff-match-patch](https://pypi.org/project/diff-match-patch/) | 1.01 s | 0.25 ms | timeit | - | ✅ |
-
-
-[^1]: [diff-match-patch](https://www.npmjs.com/package/diff-match-patch) generated `patch text` and `delta` breaks on `unicode surrogates`.
-[^2]: Adds an extra clone to the iterator because the `patch_apply` method takes mutable refc. to `diffs`.
-
## Gotchas
**Diff incompatibility with `JavaScript` libs**:
diff --git a/examples/compat.rs b/examples/compat.rs
index 691e9f9..547ca21 100644
--- a/examples/compat.rs
+++ b/examples/compat.rs
@@ -2,7 +2,7 @@ use diff_match_patch_rs::{DiffMatchPatch, Efficient, Error, PatchInput};
/// An example flow of the effitient mode
/// This demo will cover creating a diff of two texts and then `patching` it back to get the original text
-
+///
// This is the source text
const TXT_OLD: &str = "I am the very model of a modern Major-General,
I've information vegetable, animal, and mineral,
diff --git a/examples/delta.rs b/examples/delta.rs
index 4368625..62192ab 100644
--- a/examples/delta.rs
+++ b/examples/delta.rs
@@ -2,7 +2,7 @@ use diff_match_patch_rs::{DiffMatchPatch, Efficient, Error, PatchInput};
/// An example flow of the effitient mode
/// This demo will cover creating a diff of two texts and then `patching` it back to get the original text
-
+///
// This is the source text
const TXT_OLD: &str = "I am the very model of a modern Major-General,
I've information vegetable, animal, and mineral,
diff --git a/examples/efficiency.rs b/examples/efficiency.rs
index dd26948..f88c844 100644
--- a/examples/efficiency.rs
+++ b/examples/efficiency.rs
@@ -9,7 +9,8 @@ use diff_match_patch_rs::{DiffMatchPatch, Efficient, Error, PatchInput};
/// `Efficiency mode` is not compatible with other libraries or implementations of `diff-match-patch`.
/// Use `efficiency` mode ONLY if you are using this `crate` across your stack
/// If you want a standardized implementation (working across libraries), use `Compat` mode istead
-
+///
+///
// This is the source text
const TXT_OLD: &str = "I am the very model of a modern Major-General,
I've information vegetable, animal, and mineral,
diff --git a/src/dmp.rs b/src/dmp.rs
index c4be11a..fef0ebb 100644
--- a/src/dmp.rs
+++ b/src/dmp.rs
@@ -56,25 +56,27 @@ impl<T: DType> Diff<T> {
Self(op, data.to_vec())
}
- /// helper functions to create ops
+ /// helper functions to create op::Delete
pub fn delete(data: &[T]) -> Self {
Self::new(Ops::Delete, data)
}
+ /// helper functions to create op::Insert
pub fn insert(data: &[T]) -> Self {
Self::new(Ops::Insert, data)
}
+ /// helper functions to create op::Equal
pub fn equal(data: &[T]) -> Self {
Self::new(Ops::Equal, data)
}
- // returns the operation of the current diff
+ /// returns the operation of the current diff
pub fn op(&self) -> Ops {
self.0
}
- // returns the bytes of the data
+ /// returns slice of the data
pub fn data(&self) -> &[T] {
&self.1[..]
}
@@ -263,26 +265,42 @@ impl DiffMatchPatch {
}
// Trim common prefix
- let common_prefix = Self::common_prefix(old_t, new_t, false);
- let common_suffix =
- Self::common_prefix(&old_t[common_prefix..], &new_t[common_prefix..], true);
-
- let mut diffs = self.compute(
- &old_t[common_prefix..old_t.len() - common_suffix],
- &new_t[common_prefix..new_t.len() - common_suffix],
- linemode,
- deadline,
- )?;
+ let (common_prefix, common_suffix) = Self::common_prefix_suffix(old_t, new_t);
+
+ let (old_uniq_end, new_uniq_end) =
+ (old_t.len() - common_suffix, new_t.len() - common_suffix);
+
+ // Some elaborate checks to avoid bound checks
+ // slice_start_index_len_fail
+ // slice_end_index_len_fail
+ // slice_index_order_fail
+ let mut diffs = if common_prefix <= old_t.len()
+ && common_prefix <= new_t.len()
+ && old_uniq_end <= old_t.len()
+ && new_uniq_end <= new_t.len()
+ && common_prefix <= old_uniq_end
+ && common_prefix <= new_uniq_end
+ {
+ self.compute(
+ &old_t[common_prefix..old_uniq_end],
+ &new_t[common_prefix..new_uniq_end],
+ linemode,
+ deadline,
+ )?
+ } else {
+ // This is part of the bound check optimization! Should never fail!
+ return Err(crate::errors::Error::InvalidInput);
+ };
// Restore the prefix and suffix.
- if common_prefix > 0 {
+ if common_prefix > 0 && common_prefix <= old_t.len() {
let mut d = vec![Diff::equal(&old_t[..common_prefix])];
d.append(&mut diffs);
diffs = d;
}
- if common_suffix > 0 {
- diffs.push(Diff::new(Ops::Equal, &new_t[new_t.len() - common_suffix..]));
+ if common_suffix > 0 && new_t.len() >= common_suffix && new_uniq_end < new_t.len() {
+ diffs.push(Diff::new(Ops::Equal, &new_t[new_uniq_end..]));
}
Self::cleanup_merge(&mut diffs);
@@ -314,11 +332,7 @@ impl DiffMatchPatch {
};
// found a subsequence which contains the short text
- if let Some(idx) = long
- .windows(short.len())
- .step_by(1)
- .position(|k| k == short)
- {
+ if let Some(idx) = long.windows(short.len()).position(|k| k == short) {
// Shorter text is inside the longer text (speedup).
let op = if old_gt_new { Ops::Delete } else { Ops::Insert };
let diffs = vec![
@@ -383,16 +397,14 @@ impl DiffMatchPatch {
};
// pointless - two small for this algo
- if long.len() < 4 || short.len() * 2 < long.len() {
+ if long.len() < 4 || (short.len() << 1) < long.len() {
return None;
}
// First check if the second quarter is the seed for a half-match.
- // let hm1 = Self::diff_half_match_i(long, short, (long.len() as f32 / 4.).ceil() as usize);
- let hm1 = Self::half_match_i(long, short, long.len() / 4);
+ let hm1 = Self::half_match_i(long, short, (long.len() + 3) >> 2);
// Check again based on the third quarter.
- // let hm2 = Self::diff_half_match_i(long, short, (long.len() as f32 / 2.).ceil() as usize);
- let hm2 = Self::half_match_i(long, short, long.len() / 2);
+ let hm2 = Self::half_match_i(long, short, (long.len() + 1) >> 1);
if hm1.is_none() && hm2.is_none() {
return None;
@@ -458,7 +470,7 @@ impl DiffMatchPatch {
// Add a dummy entry at the end.
diffs.push(Diff::equal(&[]));
- let mut difflen = diffs.len();
+
let mut pointer = 0_usize;
// count of bytes inserted
@@ -470,7 +482,7 @@ impl DiffMatchPatch {
// same for delete
let mut delete_data = Vec::new();
- while pointer < difflen {
+ while pointer < diffs.len() {
match diffs[pointer].op() {
Ops::Insert => {
insert_n += 1;
@@ -487,7 +499,9 @@ impl DiffMatchPatch {
let idxstart = pointer - delete_n - insert_n;
let idxend = idxstart + delete_n + insert_n;
- diffs.drain(idxstart..idxend);
+ if idxstart <= idxend {
+ diffs.drain(idxstart..idxend);
+ }
pointer = idxstart;
@@ -499,7 +513,6 @@ impl DiffMatchPatch {
});
pointer += subdifflen;
- difflen = diffs.len();
}
// resetting counters
insert_n = 0;
@@ -540,21 +553,32 @@ impl DiffMatchPatch {
}
// Trim common prefix
- let common_prefix = Self::common_prefix(old, new, false);
- let common_suffix = Self::common_prefix(&old[common_prefix..], &new[common_prefix..], true);
-
- let mut diffs = self.compute_lines(
- &old[common_prefix..old.len() - common_suffix],
- &new[common_prefix..new.len() - common_suffix],
- deadline,
- )?;
+ let (common_prefix, common_suffix) = Self::common_prefix_suffix(old, new);
+
+ let (old_uniq_end, new_uniq_end) = (old.len() - common_suffix, new.len() - common_suffix);
+ let mut diffs = if common_prefix <= old.len()
+ && common_prefix <= new.len()
+ && old_uniq_end <= old.len()
+ && new_uniq_end <= new.len()
+ && common_prefix <= old_uniq_end
+ && common_prefix <= new_uniq_end
+ {
+ self.compute_lines(
+ &old[common_prefix..old_uniq_end],
+ &new[common_prefix..new_uniq_end],
+ deadline,
+ )?
+ } else {
+ // This is part of the bound check optimization! Should never fail!
+ return Err(crate::errors::Error::InvalidInput);
+ };
- if common_prefix > 0 {
+ if common_prefix > 0 && common_prefix <= old.len() {
diffs.insert(0, Diff::equal(&old[..common_prefix]));
}
- if common_suffix > 0 {
- diffs.push(Diff::new(Ops::Equal, &new[new.len() - common_suffix..]));
+ if common_suffix > 0 && new.len() >= common_suffix && new_uniq_end < new.len() {
+ diffs.push(Diff::new(Ops::Equal, &new[new_uniq_end..]));
}
Self::cleanup_merge(&mut diffs);
@@ -585,11 +609,7 @@ impl DiffMatchPatch {
};
// found a subsequence which contains the short text
- if let Some(idx) = long
- .windows(short.len())
- .step_by(1)
- .position(|k| k == short)
- {
+ if let Some(idx) = long.windows(short.len()).position(|k| k == short) {
// Shorter text is inside the longer text (speedup).
let op = if old_gt_new { Ops::Delete } else { Ops::Insert };
let diffs = vec![
@@ -648,34 +668,39 @@ impl DiffMatchPatch {
new: &'a [T],
deadline: Option<Time>,
) -> Result<Vec<Diff<T>>, crate::errors::Error> {
- // let text1_length = old.len() as isize;
- // let text2_length = new.len() as isize;
let old_len = old.len() as isize;
let new_len = new.len() as isize;
- let max_d = (old_len + new_len + 1) / 2;
- let v_offset = max_d;
- let v_len = 2 * max_d;
+ let v_offset = (old_len + new_len + 1) >> 1; // same as `max_d`, (/ 2)
+ let v_len = v_offset << 1; // (* 2)
+
+ let mut v = vec![-1_isize; (v_len << 1) as usize];
+ let (v1, v2) = v.split_at_mut(v_len as usize);
- let mut v1 = vec![-1_isize; v_len as usize];
- let mut v2 = vec![-1_isize; v_len as usize];
+ {
+ let v_trg = v_offset as usize + 1;
+ if v_trg < v1.len() {
+ v1[v_trg] = 0;
+ }
- v1[v_offset as usize + 1] = 0;
- v2[v_offset as usize + 1] = 0;
+ if v_trg < v2.len() {
+ v2[v_trg] = 0;
+ }
+ }
let delta = old_len - new_len;
// If the total number of characters is odd, then the front path will collide
// with the reverse path.
- let front = delta % 2 != 0;
+ let front = (delta & 1) != 0;
// Offsets for start and end of k loop.
// Prevents mapping of space beyond the grid.
- let mut k1start: isize = 0;
- let mut k1end: isize = 0;
- let mut k2start: isize = 0;
- let mut k2end: isize = 0;
+ let mut k1start = 0;
+ let mut k1end = 0;
+ let mut k2start = 0;
+ let mut k2end = 0;
- for d in 0..max_d {
+ for edit_dist in 0..v_offset {
// Bail out if deadline is reached.
if let Some(tout) = deadline {
#[cfg(target_arch = "wasm32")]
@@ -689,29 +714,43 @@ impl DiffMatchPatch {
}
// Walk the front path one step.
- let mut k1 = -d + k1start;
- while k1 < d + 1 - k1end {
+ let mut k1 = k1start - edit_dist;
+ while k1 < edit_dist + 1 - k1end {
let (x1, y1) = {
let k1_offset = v_offset + k1;
- let mut x1 = if k1 == -d
- || (k1 != d && v1[k1_offset as usize - 1] < v1[k1_offset as usize + 1])
- {
- v1[k1_offset as usize + 1]
+
+ let v1_prev = {
+ let prev_idx = (k1_offset - 1) as usize;
+ if prev_idx < v1.len() {
+ v1[prev_idx]
+ } else {
+ -1
+ }
+ };
+ let v1_next = {
+ let next_idx = (k1_offset + 1) as usize;
+ if next_idx < v1.len() {
+ v1[next_idx]
+ } else {
+ -1
+ }
+ };
+
+ let mut x1 = if k1 == -edit_dist || (k1 != edit_dist && v1_prev < v1_next) {
+ v1_next
} else {
- v1[k1_offset as usize - 1] + 1
+ v1_prev + 1
};
let mut y1 = x1 - k1;
- while x1 < old_len && y1 < new_len {
- let i1 = if x1 < 0 { old_len + x1 } else { x1 };
- let i2 = if y1 < 0 { new_len + y1 } else { y1 };
- if old[i1 as usize] != new[i2 as usize] {
- break;
- }
- x1 += 1;
- y1 += 1;
+ let (xi, yi) = Self::bisect_fwd_path_i(old, new, x1 as usize, y1 as usize);
+
+ y1 = yi as isize;
+ x1 = xi as isize;
+
+ if (0..v1.len() as isize).contains(&k1_offset) {
+ v1[k1_offset as usize] = x1;
}
- v1[k1_offset as usize] = x1;
(x1, y1)
};
@@ -723,10 +762,13 @@ impl DiffMatchPatch {
// Ran off the bottom of the graph.
k1start += 2;
} else if front {
- let k2_offset = v_offset + delta - k1;
- if k2_offset >= 0 && k2_offset < v_len && v2[k2_offset as usize] != -1 {
+ let k2_offset = (v_offset + delta - k1) as usize;
+ if (0..v_len as usize).contains(&k2_offset)
+ && k2_offset < v2.len()
+ && v2[k2_offset] != -1
+ {
// Mirror x2 onto top-left coordinate system.
- let x2 = old_len - v2[k2_offset as usize];
+ let x2 = old_len - v2[k2_offset];
if x1 >= x2 {
// Overlap detected.
return T::bisect_split(
@@ -744,38 +786,38 @@ impl DiffMatchPatch {
}
// Walk the reverse path one step.
- let mut k2 = -d + k2start;
- while k2 < d + 1 - k2end {
+ let mut k2 = k2start - edit_dist;
+ while k2 < edit_dist + 1 - k2end {
let (mut x2, y2) = {
- let k2_offset = v_offset + k2;
+ let k2_offset = (v_offset + k2) as usize;
+ let prev_idx = k2_offset - 1;
+ let next_idx = k2_offset + 1;
- let mut x2 = if k2 == -d
- || (k2 != d && v2[k2_offset as usize - 1] < v2[k2_offset as usize + 1])
- {
- v2[k2_offset as usize + 1]
+ let v2_prev = if prev_idx < v2.len() {
+ v2[prev_idx]
+ } else {
+ -1
+ };
+ let v2_next = if next_idx < v2.len() {
+ v2[next_idx]
} else {
- v2[k2_offset as usize - 1] + 1
+ -1
+ };
+
+ let mut x2 = if k2 == -edit_dist || (k2 != edit_dist && v2_prev < v2_next) {
+ v2_next
+ } else {
+ v2_prev + 1
};
let mut y2 = x2 - k2;
- while x2 < old_len && y2 < new_len {
- let i1 = if old_len - x2 > 0 {
- old_len - x2 - 1
- } else {
- x2 + 1
- };
- let i2 = if new_len - y2 > 0 {
- new_len - y2 - 1
- } else {
- y2 + 1
- };
- if old[i1 as usize] != new[i2 as usize] {
- break;
- }
- x2 += 1;
- y2 += 1;
+ let (xi, yi) = Self::bisect_rev_path_i(old, new, x2 as usize, y2 as usize);
+ y2 = yi as isize;
+ x2 = xi as isize;
+
+ if k2_offset < v2.len() {
+ v2[k2_offset] = x2;
}
- v2[k2_offset as usize] = x2;
(x2, y2)
};
@@ -787,10 +829,11 @@ impl DiffMatchPatch {
// Ran off the top of the graph.
k2start += 2;
} else if !front {
- let k1_offset = v_offset + delta - k2;
- if k1_offset >= 0 && k1_offset < v_len && v1[k1_offset as usize] != -1 {
- let x1 = v1[k1_offset as usize];
- let y1 = v_offset + x1 - k1_offset;
+ let k1_offset = (v_offset + delta - k2) as usize;
+ if k1_offset < v_len as usize && k1_offset < v1.len() && v1[k1_offset] != -1 {
+ // if (0..v_len).contains(&k1_offset) && k1_offset < v1.len() as isize && v1[k1_offset as usize] != -1 {
+ let x1 = v1[k1_offset];
+ let y1 = v_offset + x1 - k1_offset as isize;
// Mirror x2 onto top-left coordinate system.
x2 = old_len - x2;
if x1 >= x2 {
@@ -813,19 +856,74 @@ impl DiffMatchPatch {
Ok(vec![Diff::delete(old), Diff::insert(new)])
}
+ fn bisect_fwd_path_i<'a, T: DType>(
+ old: &'a [T],
+ new: &'a [T],
+ x1: usize,
+ y1: usize,
+ ) -> (usize, usize) {
+ let mut x1 = x1;
+ let mut y1 = y1;
+
+ while x1 < old.len() && y1 < new.len() {
+ if old[x1] != new[y1] {
+ break;
+ }
+
+ x1 += 1;
+ y1 += 1;
+ }
+
+ (x1, y1)
+ }
+
+ fn bisect_rev_path_i<'a, T: DType>(
+ old: &'a [T],
+ new: &'a [T],
+ x2: usize,
+ y2: usize,
+ ) -> (usize, usize) {
+ let mut x2 = x2;
+ let mut y2 = y2;
+
+ while x2 < old.len() && y2 < new.len() {
+ let i1 = if old.len() - x2 > 0 {
+ old.len() - x2 - 1
+ } else {
+ x2 + 1
+ };
+ let i2 = if new.len() - y2 > 0 {
+ new.len() - y2 - 1
+ } else {
+ y2 + 1
+ };
+ if old[i1] != new[i2] {
+ break;
+ }
+ x2 += 1;
+ y2 += 1;
+ }
+
+ (x2, y2)
+ }
+
// Does a substring of shorttext exist within longtext such that the substring
// is at least half the length of longtext?
- //idx Start index of quarter length substring within longtext.
- #[inline]
+ // idx Start index of quarter length substring within longtext.
fn half_match_i<'a, T: DType>(
long: &'a [T],
short: &'a [T],
idx: usize,
) -> Option<HalfMatch<'a, T>> {
// Start with a 1/4 length substring at position i as a seed.
-
- let seed = &long[idx..idx + long.len() / 4];
- let seedleen = seed.len();
+ let seed = {
+ let end = idx + (long.len() >> 2);
+ if idx < long.len() && end < long.len() {
+ &long[idx..end]
+ } else {
+ return None;
+ }
+ };
let mut j = 0;
let mut best_common: &[T] = &[];
@@ -834,30 +932,50 @@ impl DiffMatchPatch {
let mut best_short_a: &[T] = &[];
let mut best_short_b: &[T] = &[];
- while let Some(pos) = &short[j..]
- .windows(seedleen)
- .step_by(1)
- .position(|p| p == seed)
- {
- j += *pos;
+ while j < short.len() {
+ if let Some(pos) = &short[j..].windows(seed.len()).position(|p| p == seed) {
+ j += *pos;
+
+ let (prefix_len, suffix_len) = if idx < long.len() && j < short.len() {
+ (
+ Self::common_prefix(&long[idx..], &short[j..], false),
+ Self::common_prefix(&long[..idx], &short[..j], true),
+ )
+ } else {
+ // this bound check shouldn't fail .. since it has, no point going further
+ return None;
+ };
- let prefix_len = Self::common_prefix(&long[idx..], &short[j..], false);
- let suffix_len = Self::common_prefix(&long[..idx], &short[..j], true);
+ if best_common.len() < suffix_len + prefix_len {
+ let j_min_suf = j - suffix_len;
+ let j_pls_prf = j + prefix_len;
+ let i_min_suf = idx - suffix_len;
+ let i_pls_prf = idx + prefix_len;
+
+ // Optimizing for bound checks
+ if j_min_suf < j_pls_prf
+ && j_min_suf <= short.len()
+ && j_pls_prf <= short.len()
+ && i_min_suf <= long.len()
+ && i_pls_prf <= long.len()
+ {
+ best_common = &short[j_min_suf..j_pls_prf];
- if best_common.len() < suffix_len + prefix_len {
- best_common = &short[j - suffix_len..j + prefix_len];
+ best_long_a = &long[..i_min_suf];
+ best_long_b = &long[i_pls_prf..];
- best_long_a = &long[..idx - suffix_len];
- best_long_b = &long[idx + prefix_len..];
+ best_short_a = &short[..j_min_suf];
+ best_short_b = &short[j_pls_prf..];
+ }
+ }
- best_short_a = &short[..j - suffix_len];
- best_short_b = &short[j + prefix_len..];
+ j += 1;
+ } else {
+ break;
}
-
- j += 1;
}
- if best_common.len() * 2 >= long.len() {
+ if (best_common.len() << 1) >= long.len() {
Some(HalfMatch {
prefix_long: best_long_a,
suffix_long: best_long_b,
@@ -874,8 +992,6 @@ impl DiffMatchPatch {
// We are doing a binary search here, and I've observed similar performance as noted by https://neil.fraser.name/news/2007/10/09/
// Some benchmark code can be found in benches/prefix.rs
// Reverse prefix is suffix
- // TODO: investigate this further
- #[inline]
fn common_prefix<T: DType>(lhs: &[T], rhs: &[T], reverse: bool) -> usize {
if lhs.is_empty()
|| rhs.is_empty()
@@ -901,42 +1017,50 @@ impl DiffMatchPatch {
)
};
- if lhs[lhsrange] == rhs[rhsrange] {
+ if lhsrange.end <= lhs.len()
+ && rhsrange.end <= rhs.len()
+ && lhs[lhsrange] == rhs[rhsrange]
+ {
pointmin = pointmid;
pointstart = pointmin;
} else {
pointmax = pointmid;
}
- pointmid = (pointmax - pointmin) / 2 + pointmin;
+ pointmid = ((pointmax - pointmin) >> 1) + pointmin; // /2
}
pointmid
}
- #[inline]
+ fn common_prefix_suffix<T: DType>(lhs: &[T], rhs: &[T]) -> (usize, usize) {
+ let common_prefix = Self::common_prefix(lhs, rhs, false);
+ let common_suffix = {
+ if common_prefix < lhs.len() && common_prefix < rhs.len() {
+ Self::common_prefix(&lhs[common_prefix..], &rhs[common_prefix..], true)
+ } else {
+ 0
+ }
+ };
+
+ (common_prefix, common_suffix)
+ }
+
fn common_overlap<T: DType>(lhs: &[T], rhs: &[T]) -> usize {
if lhs.is_empty() || rhs.is_empty() {
return 0;
}
- let minlen = lhs.len().min(rhs.len());
-
// A working set with longer string truncated
- let l = if lhs.len() > rhs.len() {
- &lhs[lhs.len() - rhs.len()..]
+ let (l, r) = if lhs.len() > rhs.len() {
+ (&lhs[lhs.len() - rhs.len()..], rhs)
} else {
- lhs
- };
- let r = if lhs.len() < rhs.len() {
- &rhs[..lhs.len()]
- } else {
- rhs
+ (lhs, &rhs[..lhs.len()])
};
// Quick check for the worst case.
if l == r {
- return minlen;
+ return l.len();
}
// Start by looking for a single character match
@@ -945,28 +1069,18 @@ impl DiffMatchPatch {
let mut len = 1;
let mut best = 0;
- loop {
- let pattern = &l[minlen - len..];
- let found = if let Some(found) = r
- .windows(pattern.len())
- .step_by(1)
- .position(|p| p == pattern)
- {
- found
- } else {
- return best;
- };
-
+ while let Some(found) = r.windows(len).position(|p| p == &l[l.len() - len..]) {
len += found;
- if found == 0 || l[minlen - len..] == r[..len] {
+ if found == 0 || l[l.len() - len..] == r[..len] {
best = len;
len += 1;
}
}
+
+ best
}
// Reduce the number of edits by eliminating semantically trivial equalities
- #[inline]
fn cleanup_semantic<T: DType>(diffs: &mut Vec<Diff<T>>) {
let mut changes = false;
@@ -983,9 +1097,7 @@ impl DiffMatchPatch {
let mut insert_len_post = 0_usize;
let mut delete_len_post = 0_usize;
- let mut difflen = diffs.len();
-
- while pointer < difflen {
+ while pointer < diffs.len() {
let mut diff_mod = false;
if diffs[pointer].op() == Ops::Equal {
equalities.push(pointer);
@@ -1018,8 +1130,6 @@ impl DiffMatchPatch {
diffs.insert(last, Diff::delete(last_eq));
// Change the other copy to insert
diffs[last + 1].0 = Ops::Insert;
- // change diff length
- difflen = diffs.len();
// Throw away the equality we just deleted.
equalities.pop();
@@ -1057,8 +1167,6 @@ impl DiffMatchPatch {
Self::cleanup_semantic_lossless(diffs);
- difflen = diffs.len();
-
// Find any overlaps between deletions and insertions.
// e.g: <del>abcxxx</del><ins>xxxdef</ins>
// -> <del>abc</del>xxx<ins>def</ins>
@@ -1066,13 +1174,13 @@ impl DiffMatchPatch {
// -> <ins>def</ins>xxx<del>abc</del>
// Only extract an overlap if it is as big as the edit ahead or behind it.
pointer = 1;
- while difflen > 0 && pointer < difflen {
+ while !diffs.is_empty() && pointer < diffs.len() {
if diffs[pointer - 1].op() == Ops::Delete && diffs[pointer].op() == Ops::Insert {
let delete = diffs[pointer - 1].data().to_vec();
let insert = diffs[pointer].data().to_vec();
- let delete_thres = delete.len() / 2 + delete.len() % 2;
- let insert_thres = insert.len() / 2 + insert.len() % 2;
+ let delete_thres = (delete.len() >> 1) + delete.len() % 2;
+ let insert_thres = (insert.len() >> 1) + insert.len() % 2;
let overlap_1 = Self::common_overlap(&delete[..], &insert[..]);
let overlap_2 = Self::common_overlap(&insert[..], &delete[..]);
@@ -1084,7 +1192,7 @@ impl DiffMatchPatch {
diffs.insert(pointer, Diff::equal(&insert[..overlap_1]));
diffs[pointer - 1].1 = delete[..delete.len() - overlap_1].to_vec();
diffs[pointer + 1].1 = insert[overlap_1..].to_vec();
- difflen = diffs.len();
+
pointer += 1;
} else if overlap_2 >= delete_thres || overlap_2 >= insert_thres {
// Reverse overlap
@@ -1093,7 +1201,6 @@ impl DiffMatchPatch {
diffs[pointer - 1] = Diff::insert(&insert[..insert.len() - overlap_2]);
diffs[pointer + 1] = Diff::delete(&delete[overlap_2..]);
- difflen = diffs.len();
pointer += 1;
}
pointer += 1;
@@ -1104,19 +1211,13 @@ impl DiffMatchPatch {
// Look for single edits surrounded on both sides by equalities
// e.g: The c<ins>at c</ins>ame. -> The <ins>cat </ins>came.
- #[inline]
fn cleanup_semantic_lossless<T: DType>(diffs: &mut Vec<Diff<T>>) {
let mut pointer = 1_usize;
- let mut difflen = diffs.len();
// Intentionally ignore the first and last element (don't need checking).
- while difflen > 0 && pointer < difflen - 1 {
+ while !diffs.is_empty() && (1..diffs.len() - 1).contains(&pointer) {
// an edit surrounded by equalities
if diffs[pointer - 1].op() == Ops::Equal && diffs[pointer + 1].op() == Ops::Equal {
- // let mut equality_prev = diffs[pointer - 1].data().to_vec();
- // let mut edit = diffs[pointer].data().to_vec();
- // let mut equality_next = diffs[pointer + 1].data().to_vec();
-
// Shift the edit as far left as possible
let (mut equality_prev, mut edit, mut equality_next) = {
let commonlen =
@@ -1151,7 +1252,7 @@ impl DiffMatchPatch {
let mut best_score = Self::cleanup_semantic_score(&equality_prev[..], &edit[..])
+ Self::cleanup_semantic_score(&edit[..], &equality_next[..]);
- while !edit.is_empty() && !equality_next.is_empty() && edit[0] == equality_next[0] {
+ while !edit.is_empty() && edit.first() == equality_next.first() {
equality_prev.push(edit[0]);
edit.remove(0);
edit.push(equality_next[0]);
@@ -1177,7 +1278,6 @@ impl DiffMatchPatch {
} else {
diffs.remove(pointer - 1);
pointer -= 1;
- difflen = diffs.len();
}
diffs[pointer].1.clone_from(&best_edit);
@@ -1187,7 +1287,6 @@ impl DiffMatchPatch {
} else {
diffs.remove(pointer + 1);
pointer -= 1;
- difflen = diffs.len();
}
}
}
@@ -1199,7 +1298,6 @@ impl DiffMatchPatch {
// Given two strings, compute a score representing whether the internal
// boundary falls on logical boundaries
// Scores range from 6 (best) to 0 (worst)
- #[inline]
fn cleanup_semantic_score<T: DType>(one: &[T], two: &[T]) -> u8 {
let (char1, char2) = if let (Some(&char1), Some(&char2)) = (one.last(), two.first()) {
if let (Some(c1), Some(c2)) = (char1.as_char(), char2.as_char()) {
@@ -1248,12 +1346,19 @@ impl DiffMatchPatch {
// Reorder and merge like edit sections. Merge equalities.
// Any edit section can move as long as it doesn't cross an equality.
- #[inline]
fn cleanup_merge<T: DType>(diffs: &mut Vec<Diff<T>>) {
+ Self::cleanup_merge_first(diffs);
+
+ if Self::cleanup_merge_second(diffs) {
+ Self::cleanup_merge(diffs);
+ }
+ }
+
+ fn cleanup_merge_first<T: DType>(diffs: &mut Vec<Diff<T>>) {
// Push a dummy diff ... this triggers the equality as a last step
diffs.push(Diff::equal(&[]));
- let mut difflen = diffs.len();
+ // let mut difflen = diffs.len();
let mut pointer = 0_usize;
@@ -1263,7 +1368,7 @@ impl DiffMatchPatch {
let mut insert_data = vec![];
let mut delete_data = vec![];
- while pointer < difflen {
+ while pointer < diffs.len() {
match diffs[pointer].op() {
Ops::Insert => {
insert_n += 1;
@@ -1276,74 +1381,14 @@ impl DiffMatchPatch {
pointer += 1;
}
Ops::Equal => {
- // Upon reaching an equality, check for prior redundancies.
- if delete_n + insert_n > 1 {
- if delete_n != 0 && insert_n != 0 {
- // Factor out any common prefixies.
- let commonlen =
- Self::common_prefix(&insert_data[..], &delete_data[..], false);
- if commonlen != 0 {
- let tmpidx = pointer - delete_n - insert_n;
- if tmpidx > 0 && diffs[tmpidx - 1].op() == Ops::Equal {
- diffs[tmpidx - 1].1 =
- [&diffs[tmpidx - 1].1[..], &insert_data[..commonlen]]
- .concat();
- } else {
- diffs.insert(0, Diff::equal(&insert_data[..commonlen]));
- pointer += 1;
- }
- insert_data = insert_data[commonlen..].to_vec();
- delete_data = delete_data[commonlen..].to_vec();
- }
-
- // Factor out any common suffixies.
- let commonlen =
- Self::common_prefix(&insert_data[..], &delete_data[..], true);
- if commonlen > 0 {
- diffs[pointer].1 = [
- &insert_data[insert_data.len() - commonlen..],
- diffs[pointer].data(),
- ]
- .concat();
- insert_data = insert_data[..insert_data.len() - commonlen].to_vec();
- delete_data = delete_data[..delete_data.len() - commonlen].to_vec();
- }
- }
-
- // Delete the offending records and add the merged ones.
- pointer -= delete_n + insert_n;
-
- // Reversing because index will not change
- (pointer..pointer + delete_n + insert_n)
- .rev()
- .for_each(|i| {
- diffs.remove(i);
- });
- difflen = diffs.len();
-
- if !delete_data.is_empty() {
- diffs.insert(pointer, Diff::delete(&delete_data));
- pointer += 1;
- difflen = diffs.len();
- }
-
- if !insert_data.is_empty() {
- diffs.insert(pointer, Diff::insert(&insert_data));
- pointer += 1;
- difflen = diffs.len();
- }
-
- pointer += 1;
- } else if pointer != 0 && diffs[pointer - 1].op() == Ops::Equal {
- // Merge this equality with the previous one.;
- diffs[pointer - 1].1 =
- [&diffs[pointer - 1].1[..], diffs[pointer].data()].concat();
- diffs.remove(pointer);
-
- difflen = diffs.len();
- } else {
- pointer += 1;
- }
+ Self::cleanup_merge_proc_equal(
+ diffs,
+ insert_n,
+ delete_n,
+ insert_data,
+ delete_data,
+ &mut pointer,
+ );
insert_n = 0;
delete_n = 0;
@@ -1353,80 +1398,149 @@ impl DiffMatchPatch {
}
}
+ // println!("{pointer} {}", diffs.len());
+
if let Some(dl) = diffs.last() {
if dl.data().is_empty() {
diffs.pop();
}
}
+ }
- difflen = diffs.len();
-
- // Second pass: look for single edits surrounded on both sides by equalities
- // which can be shifted sideways to eliminate an equality.
- // e.g: A<ins>BA</ins>C -> <ins>AB</ins>AC
- pointer = 1;
+ // Second pass: look for single edits surrounded on both sides by equalities
+ // which can be shifted sideways to eliminate an equality.
+ // e.g: A<ins>BA</ins>C -> <ins>AB</ins>AC
+ fn cleanup_merge_second<T: DType>(diffs: &mut Vec<Diff<T>>) -> bool {
let mut changes = false;
+ let mut pointer = 1;
- while difflen > 0 && pointer < difflen - 1 {
- if let (Some(diff_prev), Some(diff), Some(diff_next)) = (
- diffs.get(pointer - 1),
- diffs.get(pointer),
- diffs.get(pointer + 1),
- ) {
- // This is a single edit surrounded by equalities.
- if diff_prev.op() == Ops::Equal && diff_next.op() == Ops::Equal {
- let substr_idx = if diff.size() >= diff_prev.size() {
- diff.size() - diff_prev.size()
- } else {
- 0
- };
- if &diff.data()[substr_idx..] == diff_prev.data() {
- // Shift the edit over the previous equality.
- let new_current_data = [
- diff_prev.data(),
- &diff.data()[..diff.size() - diff_prev.size()],
- ]
- .concat();
- let new_next_data = [diff_prev.data(), diff_next.data()].concat();
-
- diffs[pointer].1 = new_current_data;
- diffs[pointer + 1].1 = new_next_data;
- diffs.remove(pointer - 1);
- difflen = diffs.len();
+ while !diffs.is_empty() && (1..diffs.len() - 1).contains(&pointer) {
+ let (diff_prev, diff, diff_next) =
+ (&diffs[pointer - 1], &diffs[pointer], &diffs[pointer + 1]);
- changes = true;
- } else if &diff.data()[..if diff_next.size() <= diff.size() {
- diff_next.size()
- } else {
- diff.size()
- }] == diff_next.data()
- {
- // Shift the edit over the next equality.
- diffs[pointer - 1].1 =
- [&diffs[pointer - 1].1[..], diffs[pointer + 1].data()].concat();
- diffs[pointer].1 = [
- &diffs[pointer].data()[diffs[pointer + 1].size()..],
- diffs[pointer + 1].data(),
- ]
- .concat();
- diffs.remove(pointer + 1);
- difflen = diffs.len();
+ // This is a single edit surrounded by equalities.
+ if diff_prev.op() == Ops::Equal && diff_next.op() == Ops::Equal {
+ let substr_idx = if diff.size() >= diff_prev.size() {
+ diff.size() - diff_prev.size()
+ } else {
+ 0
+ };
+ if &diff.data()[substr_idx..] == diff_prev.data() {
+ // Shift the edit over the previous equality.
+ let new_current_data = [
+ diff_prev.data(),
+ &diff.data()[..diff.size() - diff_prev.size()],
+ ]
+ .concat();
+ let new_next_data = [diff_prev.data(), diff_next.data()].concat();
- changes = true;
- }
+ diffs[pointer].1 = new_current_data;
+ diffs[pointer + 1].1 = new_next_data;
+ diffs.remove(pointer - 1);
+
+ changes = true;
+ } else if &diff.data()[..if diff_next.size() <= diff.size() {
+ diff_next.size()
+ } else {
+ diff.size()
+ }] == diff_next.data()
+ {
+ // Shift the edit over the next equality.
+ diffs[pointer - 1].1 =
+ [&diffs[pointer - 1].1[..], diffs[pointer + 1].data()].concat();
+ diffs[pointer].1 = [
+ &diffs[pointer].data()[diffs[pointer + 1].size()..],
+ diffs[pointer + 1].data(),
+ ]
+ .concat();
+ diffs.remove(pointer + 1);
+
+ changes = true;
}
}
pointer += 1;
}
- if changes {
- Self::cleanup_merge(diffs);
+ changes
+ }
+
+ // Handling `Equality` in the merge process
+ fn cleanup_merge_proc_equal<T: DType>(
+ diffs: &mut Vec<Diff<T>>,
+ insert_n: usize,
+ delete_n: usize,
+ insert_data: Vec<T>,
+ delete_data: Vec<T>,
+ pointer: &mut usize,
+ ) {
+ let mut insert_data = insert_data;
+ let mut delete_data = delete_data;
+
+ // Upon reaching an equality, check for prior redundancies.
+ if delete_n + insert_n > 1 {
+ if delete_n != 0 && insert_n != 0 && !insert_data.is_empty() && !delete_data.is_empty()
+ {
+ // Factor out any common prefixies.
+ let commonlen = Self::common_prefix(&insert_data[..], &delete_data[..], false);
+ if commonlen != 0 && commonlen < insert_data.len() && commonlen < delete_data.len()
+ {
+ let tmpidx = *pointer - delete_n - insert_n - 1;
+ if (0..diffs.len()).contains(&tmpidx) && diffs[tmpidx].op() == Ops::Equal {
+ diffs[tmpidx].1 =
+ [&diffs[tmpidx].1[..], &insert_data[..commonlen]].concat();
+ } else {
+ diffs.insert(0, Diff::equal(&insert_data[..commonlen]));
+ *pointer += 1;
+ }
+ insert_data = insert_data[commonlen..].to_vec();
+ delete_data = delete_data[commonlen..].to_vec();
+ }
+
+ // Factor out any common suffixies.
+ let commonlen = Self::common_prefix(&insert_data[..], &delete_data[..], true);
+ if commonlen > 0 {
+ diffs[*pointer].1 = [
+ &insert_data[insert_data.len() - commonlen..],
+ diffs[*pointer].data(),
+ ]
+ .concat();
+ insert_data = insert_data[..insert_data.len() - commonlen].to_vec();
+ delete_data = delete_data[..delete_data.len() - commonlen].to_vec();
+ }
+ }
+
+ // Delete the offending records and add the merged ones.
+ *pointer -= delete_n + insert_n;
+
+ // Reversing because index will not change
+ (*pointer..*pointer + delete_n + insert_n)
+ .rev()
+ .for_each(|i| {
+ diffs.remove(i);
+ });
+
+ if !delete_data.is_empty() {
+ diffs.insert(*pointer, Diff::delete(&delete_data));
+ *pointer += 1;
+ }
+
+ if !insert_data.is_empty() {
+ diffs.insert(*pointer, Diff::insert(&insert_data));
+ *pointer += 1;
+ }
+
+ *pointer += 1;
+ } else if pointer != &0 && diffs[*pointer - 1].op() == Ops::Equal {
+ // Merge this equality with the previous one.;
+ diffs[*pointer - 1].1 = [&diffs[*pointer - 1].1[..], diffs[*pointer].data()].concat();
+ diffs.remove(*pointer);
+ } else {
+ *pointer += 1;
}
}
// Reduce the number of edits by eliminating operationally trivial equalities.
- #[inline]
fn cleanup_efficiency<T: DType>(&self, diffs: &mut Vec<Diff<T>>) {
if diffs.is_empty() {
return;
@@ -1484,15 +1598,18 @@ impl DiffMatchPatch {
if let Some(le) = &mut last_eq {
if (pre_ins && pre_del && post_ins && post_del)
- || (le.len() < edit_cost / 2
+ || (le.len() < (edit_cost >> 1)
&& pre_ins as u8 + pre_del as u8 + post_ins as u8 + post_del as u8 == 3)
{
// Duplicate record.
- let item = equalities.pop().unwrap();
- // change the second copy (after the insert in next line) to Insert
- diffs[item].0 = Ops::Insert;
- // add an item
- diffs.insert(item, Diff::delete(le));
+ if let Some(item) = equalities.pop() {
+ if (0..diffs.len()).contains(&item) {
+ // change the second copy (after the insert in next line) to Insert
+ diffs[item].0 = Ops::Insert;
+ // add an item
+ diffs.insert(item, Diff::delete(le));
+ }
+ }
last_eq = None;
@@ -1514,8 +1631,7 @@ impl DiffMatchPatch {
changes = true;
continue;
};
- // pointer = equalitiesLength > 0 ?
- // equalities[equalitiesLength - 1] : -1;
+
post_ins = false;
post_del = false;
}
@@ -1755,13 +1871,11 @@ impl DiffMatchPatch {
let mut linehash: HashMap<&'a [T], usize> = HashMap::new();
// Allocate 2/3rds of the UTF16::MAX (65535) value space for text1, the rest for text2.
- // let mut maxlines = 5;
let mut maxlines = 40000;
let chars_old = Self::lines_to_chars_internal(old, &mut lines, &mut linehash, maxlines);
// This basically represents the U16::MAX value
maxlines = 65535;
- // maxlines = 7;
let chars_new = Self::lines_to_chars_internal(new, &mut lines, &mut linehash, maxlines);
LineToChars {
@@ -1771,7 +1885,6 @@ impl DiffMatchPatch {
}
}
- #[inline]
fn lines_to_chars_internal<'a, T: DType>(
text: &'a [T],
array: &mut Vec<&'a [T]>,
@@ -1780,7 +1893,6 @@ impl DiffMatchPatch {
) -> Vec<usize> {
let take = maxlines - array.len();
- // let mut lines = ;
let mut charlist = Vec::with_capacity(take);
let mut broke = false;
@@ -1806,7 +1918,7 @@ impl DiffMatchPatch {
});
// We broke at max lines, so we'll account for the remaining text
- if broke {
+ if broke && cursor <= text.len() {
let line = &text[cursor..];
let e = hash.entry(line).or_insert(array.len());
// Fresh insert
@@ -1820,25 +1932,20 @@ impl DiffMatchPatch {
charlist
}
- #[inline]
fn chars_to_lines<T: DType>(diffs: &[Diff<usize>], lines: &[&[T]]) -> Vec<Diff<T>> {
diffs
.iter()
.map(|d| {
- let t = d
- .data()
- .iter()
- .map(|&d| {
- *lines.get(d).unwrap() // investigate
- })
- .collect::<Vec<_>>()
- .concat();
+ let mut line = vec![];
+ d.data().iter().for_each(|d| {
+ if let Some(l) = lines.get(*d) {
+ line.append(&mut l.to_vec())
+ }
+ });
- Diff::new(d.op(), &t)
+ Diff::new(d.op(), &line)
})
.collect::<Vec<_>>()
-
- // res
}
}
@@ -1881,7 +1988,6 @@ impl DiffMatchPatch {
// Is there a nearby exact match? (speedup)
if let Some(best_loc) = text
.windows(pattern.len())
- .step_by(1)
.skip(loc)
.position(|p| p == pattern)
.map(|pos| pos + loc)
@@ -1894,7 +2000,6 @@ impl DiffMatchPatch {
// best_loc = text.lastIndexOf(pattern, loc + pattern.length);
if let Some(best_loc_rev) = text
.windows(pattern.len())
- .step_by(1)
.skip(loc)
.rev()
.position(|p| p == pattern)
@@ -2293,7 +2398,7 @@ impl DiffMatchPatch {
.rev()
.position(|p| p == pattern)
.map(|p| text.len() - p - pattern.len())
- && pattern.len() < self.match_max_bits() - patch_margin * 2)
+ && pattern.len() < self.match_max_bits() - (patch_margin << 1))
{
padding += patch_margin;
diff --git a/src/lib.rs b/src/lib.rs
index 06eb3dc..c5c1cdf 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -5,8 +5,7 @@
//! [<img alt="docs.rs" src="https://img.shields.io/badge/docs.rs-diff_match_patch_rs?style=for-the-badge&logo=docs.rs&labelColor=%23555555" height="20">](https://docs.rs/diff-match-patch-rs)
//!
//!
-//! A very **fast**, **accurate** and **wasm ready** port of [Diff Match Patch](https://github.com/dmsnell/diff-match-patch) in Rust. The
-//! diff implementation is based on [Myers' diff algorithm](https://neil.fraser.name/writing/diff/myers.pdf).
+//! **Fastest**, **accurate** and **wasm ready** implementation of [Diff Match Patch](https://github.com/dmsnell/diff-match-patch) in Rust based on [Myers' diff algorithm](https://neil.fraser.name/writing/diff/myers.pdf).
//!
//! ## Highlights of this crate
//! - Exposes two modes of operating with `diff-match-patch`, a `Efficient` mode and `Compat` mode. While the `Efficient` mode squeezes out the max performance the `Compat` mode ensures compatibility across other libraries or implementations (rust or otherwise). According to [Benchmarks](#benchmarks), our slower `Compat` mode is still faster than other implementations in rust.
@@ -19,11 +18,28 @@
//! - Added a `fuzzer` for sanity
//! - Exposes the same APIs as [Diff Match Patch](https://github.com/dmsnell/diff-match-patch) with minor changes to make it more idiomatic in Rust.
//!
+//! ## Benchmarks
+//! Benchmarks are maintained [diff-match-patch-bench repository](https://github.com/AnubhabB/diff-match-patch-rs-bench)
+//!
+//! | Lang. | Library | Diff Avg. | Patch Avg. | Bencher | Mode | Correct |
+//! |:-------:|:----------------------------------------------------------------------------------------:|:---------:|:----------:|:----------:|:-----------:|:-------:|
+//! | `rust` | [diff_match_patch v0.1.1](https://crates.io/crates/diff_match_patch)[^1] | 56.618 ms | 9.00 ms | Criterion | - | ✅ |
+//! | `rust` | [dmp v0.2.0](https://crates.io/crates/dmp) | 56.609 ms | 12.25 ms | Criterion | - | ✅ |
+//! | `rust` | [diff-match-patch-rs](https://github.com/AnubhabB/diff-match-patch-rs.git)<sup>our</sup> | 32.795 ms | 533.17 µs | Criterion | `Efficient`| ✅ |
+//! | `rust` | [diff-match-patch-rs](https://github.com/AnubhabB/diff-match-patch-rs.git)<sup>our</sup> | 33.222 ms | 993.90 µs | Criterion | `Compat` | ✅ |
+//! | `go` | [go-diff](https://github.com/sergi/go-diff)[^2] | 30.31 ms | 118.2 ms | go test | - | ❗ |
+//! | `node` | [diff-match-patch](https://www.npmjs.com/package/diff-match-patch)[^3] | 246.90 ms | 1.07 ms | tinybench | - | ❌ |
+//! | `python`| [diff-match-patch](https://pypi.org/project/diff-match-patch/) | 1.01 s | 0.25 ms | timeit | - | ✅ |
+//!
+//! [^1]: [diff_match_patch](https://crates.io/crates/diff_match_patch) Adds an extra clone to the iterator because the `patch_apply` method takes mutable refc. to `diffs`.
+//! [^2]: [go-diff](https://github.com/sergi/go-diff) is technically a correct implementation but it generates far more diffs than any other implementation I've tested. E.g. In our test data [here](https://github.com/AnubhabB/diff-match-patch-rs-bench/testdata) the go lib ends up generating `~3000` diffs while other implementations are generating `~2300` diffs. My guess is one of the cleanup passes are skipped by `go-diff`.That's probably the reason behind unreasonably high `patch` timings.
+//! [^3]: [diff-match-patch](https://www.npmjs.com/package/diff-match-patch) generated `patch text` and `delta` breaks on `unicode surrogates`.
+//!
//! ## Usage Examples
//!
//! ```toml
//! [dependencies]
-//! diff-match-patch-rs = "0.3.2"
+//! diff-match-patch-rs = "0.4.0"
//! ```
//!
//! ### `Effitient` mode
@@ -176,23 +192,6 @@
//!
//! <div class="warning">The `Effitient` and `Compat` modes are mutually exclusive and will not generate correct output if used interchangibly at source and destination</div>
//!
-//! ## Benchmarks
-//! Benchmarks are maintained [diff-match-patch-bench repository](https://github.com/AnubhabB/diff-match-patch-rs-bench)
-//!
-//! | Lang. | Library | Diff Avg. | Patch Avg. | Bencher | Mode | Correct |
-//! |:-------:|:----------------------------------------------------------------------------------------:|:---------:|:----------:|:----------:|:-----------:|:-------:|
-//! | `rust` | [diff_match_patch v0.1.1](https://crates.io/crates/diff_match_patch)[^2] | 68.108 ms | 10.596 ms | Criterion | - | ✅ |
-//! | `rust` | [dmp v0.2.0](https://crates.io/crates/dmp) | 69.019 ms | 14.654 ms | Criterion | - | ✅ |
-//! | `rust` | [diff-match-patch-rs](https://github.com/AnubhabB/diff-match-patch-rs.git)<sup>our</sup> | 64.66 ms | 631.13 µs | Criterion | `Efficient` | ✅ |
-//! | `rust` | [diff-match-patch-rs](https://github.com/AnubhabB/diff-match-patch-rs.git)<sup>our</sup> | 64.68 ms | 1.1703 ms | Criterion | `Compat` | ✅ |
-//! | `go` | [go-diff](https://github.com/sergi/go-diff) | 50.31 ms | 135.2 ms | go test | - | ✅ |
-//! | `node` | [diff-match-patch](https://www.npmjs.com/package/diff-match-patch)[^1] | 246.90 ms | 1.07 ms | tinybench | - | ❌ |
-//! | `python`| [diff-match-patch](https://pypi.org/project/diff-match-patch/) | 1.01 s | 0.25 ms | timeit | - | ✅ |
-//!
-//! [^1]: [diff-match-patch](https://www.npmjs.com/package/diff-match-patch) generated `patch text` and `delta` breaks on `unicode surrogates`.
-//! [^2]: Adds an extra clone to the iterator because the `patch_apply` method takes mutable refc. to `diffs`.
-//!
-//!
//! ## Gotchas
//! **Diff incompatibility with `JavaScript` libs**:
//!