-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new GREL function to calculate the edit distance #6613
Comments
Which algorithm would be best if folks want to compute edit distances between two VERY VERY large strings (one version of a books' contents to another version of a books' content) ? The reason there are many algorithms is that there is no 1 size fits all. |
#6627 (comment) Can you suggest some improvements |
@dino2580 Yes. Don't roll our own edit distance implementation. Instead, use the standard highly optimized Apache Commons Text library that already has intelligence for using 2D or 1D cost tables for edit distance using the Levenshtein distance metric. Code details in https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java You might not know this, but "edit distance" can be calculated differently based on if the algorithm allows removal, insertion, substitution, or transposition of a character in a string. We "could" use the Jaro-Winkler algorithm for returning the edit distance... but it would be more standard practice for a generic default Anyways, you can simply create the GREL function
The LevenshteinDistance already has an |
It would be easier if we could calculate number of edits required to make one value perfectly match another using GREL.
Proposed solution
Provide a new builtin GREL function, called editDistance(), that takes 2 strings and return the the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other
For example:
editDistance("New York", "newyork")
->3
editDistance(“M. Makeba”, “Miriam Makeba”)
->5
Alternatives considered
Using Jython
The text was updated successfully, but these errors were encountered: