diff --git a/README.md b/README.md index 51985c5..2025adb 100644 --- a/README.md +++ b/README.md @@ -10,204 +10,233 @@ Aho-Corasick Dependency ---------- + Include this dependency in your POM. Be sure to check for the latest version in Maven Central. + ```xml - - org.ahocorasick - ahocorasick - 0.4.0 - + + org.ahocorasick + ahocorasick + 0.5.0 + ``` Introduction ------------ -Nowadays most free-text searching is based on Lucene-like approaches, where the search text is parsed into its -various components. For every keyword a lookup is done to see where it occurs. When looking for a couple of keywords -this approach is great. But what about it if you are not looking for just a couple of keywords, but a 100,000 of -them? Like, for example, checking against a dictionary? -This is where the Aho-Corasick algorithm shines. Instead of chopping up the search text, it uses all the keywords -to build up a construct called a [Trie](http://en.wikipedia.org/wiki/Trie). There are three crucial components -to Aho-Corasick: +Most free-text searching is based on Lucene-like approaches, where the +search text is parsed into its various components. For every keyword a +lookup is done to see where it occurs. When looking for a couple of keywords +this approach is great, but when searching for 100,000 words, the approach +is quite slow (for example, checking against a dictionary). + +The Aho-Corasick algorithm shines when looking for multiple words. +Rather than chop up the search text, it uses all the keywords to build +a [Trie](http://en.wikipedia.org/wiki/Trie) construct. The crucial +Aho-Corasick components include: + * goto * fail * output -Every character encountered is presented to a state object within the *goto* structure. If there is a matching state, -that will be elevated to the new current state. +Every character encountered is presented to a state object within the +*goto* structure. If there is a matching state, that will be elevated to +the new current state. -However, if there is no matching state, the algorithm will signal a *fail* and fall back to states with less depth -(ie, a match less long) and proceed from there, until it found a matching state, or it has reached the root state. +However, if there is no matching state, the algorithm will signal a +*fail* and fall back to states with less depth (i.e.,, a match less long) +and proceed from there, until it found a matching state, or it has reached +the root state. -Whenever a state is reached that matches an entire keyword, it is emitted to an *output* set which can be read after -the entire scan has completed. +Whenever a state is reached that matches an entire keyword, it is +emitted to an *output* set which can be read after the entire scan +has completed. -The beauty of the algorithm is that it is O(n). No matter how many keywords you have, or how big the search text is, -the performance will decline in a linear way. +The algorithm is O(n). No matter how many keywords are given, or how large +the search text is, the performance will decline linearly. -Some examples you could use the Aho-Corasick algorithm for: -* looking for certain words in texts in order to URL link or emphasize them -* adding semantics to plain text -* checking against a dictionary to see if syntactic errors were made +The Aho-Corasick algorithm can help: -This library is the Java implementation of the afore-mentioned Aho-Corasick algorithm for efficient string matching. -The algorithm is explained in great detail in the white paper written by -Aho and Corasick: http://cr.yp.to/bib/1975/aho.pdf +* find words in texts to link or emphasize them; +* add semantics to plain text; or +* checki against a dictionary to see if syntactic errors were made. + +See the [white paper](http://cr.yp.to/bib/1975/aho.pdf) by Aho and +Corasick for algorithmic details. Usage ----- -Setting up the Trie is a piece of cake: +Set up the Trie using a builder as follows: + ```java - Trie trie = Trie.builder() +Trie trie = Trie.builder() + .addKeyword("hers") + .addKeyword("his") + .addKeyword("she") + .addKeyword("he") + .build(); +Collection emits = trie.parseText("ushers"); +``` + +The collection will contain `Emit` objects that match: + +* "she" starting at position 1, ending at position 3 +* "he" starting at position 2, ending at position 3 +* "hers" starting at position 2, ending at position 5 + +In situations where overlapping instances are not desired, retain +the longest and left-most matches by calling `ignoreOverlaps()`: + +```java +Trie trie = Trie.builder() + .ignoreOverlaps() + .addKeyword("hot") + .addKeyword("hot chocolate") + .build(); +Collection emits = trie.parseText("hot chocolate"); +``` + +The `ignoreOverlaps()` method tells the Trie to remove all overlapping +matches. For this it relies on the following conflict resolution rules: + +1. longer matches prevail over shorter matches; and +1. left-most prevails over right-most. + +Only one result is returned: + +* "hot chocolate" starting at position 0, ending at position 12 + +To check for whole words exclusively, call `onlyWholeWords()` as follows: + +```java +Trie trie = Trie.builder() + .onlyWholeWords() + .addKeyword("sugar") + .build(); +Collection emits = trie.parseText("sugarcane sugar canesugar"); +``` + +Only one match is found; whereas, without calling `onlyWholeWords()` four +matches are found. The sugarcane/canesugar words are discarded because +they are partial matches. + +Some text is `WrItTeN` in mixed case, which makes it hard to identify. +Instruct the Trie to convert the searchtext to lowercase to ease the +matching process. The lower-casing applies to keywords as well. + +```java +Trie trie = Trie.builder() + .ignoreCase() + .addKeyword("casing") + .build(); +Collection emits = trie.parseText("CaSiNg"); +``` + +Normally, this match would not be found. By calling `ignoreCase()`, +the entire search text is made lowercase before matching begins. +Therefore it will find exactly one match. + +It is also possible to just ask whether the text matches any of +the keywords, or just to return the first match it finds. + +```java +Trie trie = Trie.builder().ignoreOverlaps() + .addKeyword("ab") + .addKeyword("cba") + .addKeyword("ababc") + .build(); +Emit firstMatch = trie.firstMatch("ababcbab"); +``` + +The value for `firstMatch` will be "ababc" from position 0. The +`containsMatch()` method checks whether `firstMatch` found a match and +returns `true` if that is the case. + +For a barebones Aho-Corasick algorithm with a custom emit handler use: + +```java +Trie trie = Trie.builder() .addKeyword("hers") .addKeyword("his") .addKeyword("she") .addKeyword("he") .build(); - Collection emits = trie.parseText("ushers"); -``` -You can now read the set. In this case it will find the following: -* "she" starting at position 1, ending at position 3 -* "he" starting at position 2, ending at position 3 -* "hers" starting at position 2, ending at position 5 +final List emits = new ArrayList<>(); +EmitHandler emitHandler = new EmitHandler() { -In normal situations you probably want to remove overlapping instances, retaining the longest and left-most -matches. - -```java - Trie trie = Trie.builder() - .ignoreOverlaps() - .addKeyword("hot") - .addKeyword("hot chocolate") - .build(); - Collection emits = trie.parseText("hot chocolate"); -``` - -The ignoreOverlaps method tells the Trie to remove all overlapping matches. For this it relies on the following -conflict resolution rules: 1) longer matches prevail over shorter matches, 2) left-most prevails over right-most. -There is only one result now: -* "hot chocolate" starting at position 0, ending at position 12 - -If you want the algorithm to only check for whole words, you can tell the Trie to do so: - -```java - Trie trie = Trie.builder() - .onlyWholeWords() - .addKeyword("sugar") - .build(); - Collection emits = trie.parseText("sugarcane sugarcane sugar canesugar"); -``` - -In this case, it will only find one match, whereas it would normally find four. The sugarcane/canesugar words -are discarded because they are partial matches. - -Some text is WrItTeN in a combination of lowercase and uppercase and therefore hard to identify. You can instruct -the Trie to lowercase the entire searchtext to ease the matching process. The lower-casing extends to keywords as well. - -```java - Trie trie = Trie.builder() - .ignoreCase() - .addKeyword("casing") - .build(); - Collection emits = trie.parseText("CaSiNg"); -``` - -Normally, this match would not be found. With the ignoreCase settings the entire search text is lowercased -before the matching begins. Therefore it will find exactly one match. Since you still have control of the original -search text and you will know exactly where the match was, you can still utilize the original casing. - -It is also possible to just ask whether the text matches any of the keywords, or just to return the first match it -finds. - -```java - Trie trie = Trie.builder().ignoreOverlaps() - .addKeyword("ab") - .addKeyword("cba") - .addKeyword("ababc") - .build(); - Emit firstMatch = trie.firstMatch("ababcbab"); -``` - -The firstMatch will now be "ababc" found at position 0. containsMatch just checks if there is a firstMatch and -returns true if that is the case. - -If you just want the barebones Aho-Corasick algorithm (ie, no dealing with case insensitivity, overlaps and whole - words) and you prefer to add your own handler to the mix, that is also possible. - -```java - Trie trie = Trie.builder() - .addKeyword("hers") - .addKeyword("his") - .addKeyword("she") - .addKeyword("he") - .build(); - - final List emits = new ArrayList<>(); - EmitHandler emitHandler = new EmitHandler() { - - @Override - public void emit(Emit emit) { - emits.add(emit); - } - }; -``` - -In many cases you may want to do useful stuff with both the non-matching and the matching text. In this case, you -might be better served by using the Trie.tokenize(). It allows you to loop over the entire text and deal with -matches as soon as you encounter them. Let's look at an example where we want to highlight words from HGttG in HTML: - -```java - String speech = "The Answer to the Great Question... Of Life, " + - "the Universe and Everything... Is... Forty-two,' said " + - "Deep Thought, with infinite majesty and calm."; - Trie trie = Trie.builder().ignoreOverlaps().onlyWholeWords().ignoreCase() - .addKeyword("great question") - .addKeyword("forty-two") - .addKeyword("deep thought") - .build(); - Collection tokens = trie.tokenize(speech); - StringBuffer html = new StringBuffer(); - html.append("

"); - for (Token token : tokens) { - if (token.isMatch()) { - html.append(""); - } - html.append(token.getFragment()); - if (token.isMatch()) { - html.append(""); - } + @Override + public void emit(Emit emit) { + emits.add(emit); } - html.append("

"); - System.out.println(html); +}; +``` + +In many cases you may want to do perform tasks with both the non-matching +and the matching text. Such implementations may be better served by using +`Trie.tokenize()`. The `tokenize()` method allows looping over the +corpus to deal with matches as soon as they are encountered. Here's an +example that outputs key words as italicized HTML elements: + +```java +String speech = "The Answer to the Great Question... Of Life, " + + "the Universe and Everything... Is... Forty-two,' said " + + "Deep Thought, with infinite majesty and calm."; + +Trie trie = Trie.builder().ignoreOverlaps().onlyWholeWords().ignoreCase() + .addKeyword("great question") + .addKeyword("forty-two") + .addKeyword("deep thought") + .build(); + +Collection tokens = trie.tokenize(speech); +StringBuilder html = new StringBuilder(); +html.append("

"); + +for (Token token : tokens) { + if (token.isMatch()) { + html.append(""); + } + html.append(token.getFragment()); + if (token.isMatch()) { + html.append(""); + } +} + +html.append("

"); +System.out.println(html); ``` You can also emit custom outputs. This might for example be useful to implement a trivial named entity recognizer. In this case use a PayloadTrie instead of a Trie: ```java - class Word { - private final String gender; - public Word(String gender) { - this.gender = gender; - } +class Word { + private final String gender; + public Word(String gender) { + this.gender = gender; } - - PayloadTrie trie = PayloadTrie.builder() - .addKeyword("hers", new Word("f") - .addKeyword("his", new Word("m")) - .addKeyword("she", new Word("f")) - .addKeyword("he", new Word("m")) - .build(); - Collection> emits = trie.parseText("ushers"); +} + +PayloadTrie trie = PayloadTrie.builder() + .addKeyword("hers", new Word("f") + .addKeyword("his", new Word("m")) + .addKeyword("she", new Word("f")) + .addKeyword("he", new Word("m")) + .addKeyword("nonbinary", new Word("nb")) + .addKeyword("transgender", new Word("tg")) + .build(); +Collection> emits = trie.parseText("ushers"); ``` Releases -------- -Information on the aho-corasick [releases](https://github.com/robert-bor/aho-corasick/releases). + +See [releases](https://github.com/robert-bor/aho-corasick/releases) for details. License ------- + Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at @@ -219,3 +248,4 @@ License WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. +