Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new GREL function to calculate the edit distance #6613

Open
zyadtaha opened this issue May 20, 2024 · 3 comments · May be fixed by #6628
Open

Add new GREL function to calculate the edit distance #6613

zyadtaha opened this issue May 20, 2024 · 3 comments · May be fixed by #6628
Labels
grel The default expression language, GREL, could be improved in many ways! Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.

Comments

@zyadtaha
Copy link
Contributor

It would be easier if we could calculate number of edits required to make one value perfectly match another using GREL.

Proposed solution

Provide a new builtin GREL function, called editDistance(), that takes 2 strings and return the the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other

For example:
editDistance("New York", "newyork") -> 3
editDistance(“M. Makeba”, “Miriam Makeba”) -> 5

Alternatives considered

Using Jython

@zyadtaha zyadtaha added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators grel The default expression language, GREL, could be improved in many ways! labels May 20, 2024
@thadguidry
Copy link
Member

  1. Also, do we code our own, and optimize and use only a single 1-D array?
    https://www.geeksforgeeks.org/edit-distance-dp-5/

  2. or use Apache Commons Text and then which algorithm for edit distance?

Which algorithm would be best if folks want to compute edit distances between two VERY VERY large strings (one version of a books' contents to another version of a books' content) ?

The reason there are many algorithms is that there is no 1 size fits all.
We might have to offer 2 or 3 editDistance functions, or name the edit distance functions appropriately for the underlying algorithm and document the functions well.

@dino2580
Copy link
Contributor

#6627 (comment) Can you suggest some improvements

@thadguidry
Copy link
Member

thadguidry commented May 24, 2024

@dino2580 Yes. Don't roll our own edit distance implementation. Instead, use the standard highly optimized Apache Commons Text library that already has intelligence for using 2D or 1D cost tables for edit distance using the Levenshtein distance metric. Code details in https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java

You might not know this, but "edit distance" can be calculated differently based on if the algorithm allows removal, insertion, substitution, or transposition of a character in a string. We "could" use the Jaro-Winkler algorithm for returning the edit distance... but it would be more standard practice for a generic default editDistance() GREL function to use the Levenshtein distance algorithm (and one implemented like in Apache Commons Text that is already coded and highly optimized for string lengths and decreased memory usage)

Anyways, you can simply create the GREL function editDistance() to use the library's LevenshteinDistance with the threshold being an optional argument supplied in the GREL syntax.

editDistance(String s1, String s2, Integer threshold (optional))

returns the Levenshtein distance between two Strings.

The LevenshteinDistance already has an apply method that you use, and will use either unlimitedCompare or limitedCompare (2D or 1D cost tables, respectively) as you can see in the above Code details link.
Here's the API itself that you would use in the editDistance function: https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/LevenshteinDistance.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grel The default expression language, GREL, could be improved in many ways! Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
3 participants