Simplified README text, examples, and corrected typos

This commit is contained in:
Dave Jarvis 2019-09-19 17:21:44 -07:00
parent 9f80565b53
commit 864df8140f

342
README.md
View File

@ -10,204 +10,233 @@ Aho-Corasick
Dependency
----------
Include this dependency in your POM. Be sure to check for the latest version in Maven Central.
```xml
<dependency>
<groupId>org.ahocorasick</groupId>
<artifactId>ahocorasick</artifactId>
<version>0.4.0</version>
</dependency>
<dependency>
<groupId>org.ahocorasick</groupId>
<artifactId>ahocorasick</artifactId>
<version>0.5.0</version>
</dependency>
```
Introduction
------------
Nowadays most free-text searching is based on Lucene-like approaches, where the search text is parsed into its
various components. For every keyword a lookup is done to see where it occurs. When looking for a couple of keywords
this approach is great. But what about it if you are not looking for just a couple of keywords, but a 100,000 of
them? Like, for example, checking against a dictionary?
This is where the Aho-Corasick algorithm shines. Instead of chopping up the search text, it uses all the keywords
to build up a construct called a [Trie](http://en.wikipedia.org/wiki/Trie). There are three crucial components
to Aho-Corasick:
Most free-text searching is based on Lucene-like approaches, where the
search text is parsed into its various components. For every keyword a
lookup is done to see where it occurs. When looking for a couple of keywords
this approach is great, but when searching for 100,000 words, the approach
is quite slow (for example, checking against a dictionary).
The Aho-Corasick algorithm shines when looking for multiple words.
Rather than chop up the search text, it uses all the keywords to build
a [Trie](http://en.wikipedia.org/wiki/Trie) construct. The crucial
Aho-Corasick components include:
* goto
* fail
* output
Every character encountered is presented to a state object within the *goto* structure. If there is a matching state,
that will be elevated to the new current state.
Every character encountered is presented to a state object within the
*goto* structure. If there is a matching state, that will be elevated to
the new current state.
However, if there is no matching state, the algorithm will signal a *fail* and fall back to states with less depth
(ie, a match less long) and proceed from there, until it found a matching state, or it has reached the root state.
However, if there is no matching state, the algorithm will signal a
*fail* and fall back to states with less depth (i.e.,, a match less long)
and proceed from there, until it found a matching state, or it has reached
the root state.
Whenever a state is reached that matches an entire keyword, it is emitted to an *output* set which can be read after
the entire scan has completed.
Whenever a state is reached that matches an entire keyword, it is
emitted to an *output* set which can be read after the entire scan
has completed.
The beauty of the algorithm is that it is O(n). No matter how many keywords you have, or how big the search text is,
the performance will decline in a linear way.
The algorithm is O(n). No matter how many keywords are given, or how large
the search text is, the performance will decline linearly.
Some examples you could use the Aho-Corasick algorithm for:
* looking for certain words in texts in order to URL link or emphasize them
* adding semantics to plain text
* checking against a dictionary to see if syntactic errors were made
The Aho-Corasick algorithm can help:
This library is the Java implementation of the afore-mentioned Aho-Corasick algorithm for efficient string matching.
The algorithm is explained in great detail in the white paper written by
Aho and Corasick: http://cr.yp.to/bib/1975/aho.pdf
* find words in texts to link or emphasize them;
* add semantics to plain text; or
* checki against a dictionary to see if syntactic errors were made.
See the [white paper](http://cr.yp.to/bib/1975/aho.pdf) by Aho and
Corasick for algorithmic details.
Usage
-----
Setting up the Trie is a piece of cake:
Set up the Trie using a builder as follows:
```java
Trie trie = Trie.builder()
Trie trie = Trie.builder()
.addKeyword("hers")
.addKeyword("his")
.addKeyword("she")
.addKeyword("he")
.build();
Collection<Emit> emits = trie.parseText("ushers");
```
The collection will contain `Emit` objects that match:
* "she" starting at position 1, ending at position 3
* "he" starting at position 2, ending at position 3
* "hers" starting at position 2, ending at position 5
In situations where overlapping instances are not desired, retain
the longest and left-most matches by calling `ignoreOverlaps()`:
```java
Trie trie = Trie.builder()
.ignoreOverlaps()
.addKeyword("hot")
.addKeyword("hot chocolate")
.build();
Collection<Emit> emits = trie.parseText("hot chocolate");
```
The `ignoreOverlaps()` method tells the Trie to remove all overlapping
matches. For this it relies on the following conflict resolution rules:
1. longer matches prevail over shorter matches; and
1. left-most prevails over right-most.
Only one result is returned:
* "hot chocolate" starting at position 0, ending at position 12
To check for whole words exclusively, call `onlyWholeWords()` as follows:
```java
Trie trie = Trie.builder()
.onlyWholeWords()
.addKeyword("sugar")
.build();
Collection<Emit> emits = trie.parseText("sugarcane sugar canesugar");
```
Only one match is found; whereas, without calling `onlyWholeWords()` four
matches are found. The sugarcane/canesugar words are discarded because
they are partial matches.
Some text is `WrItTeN` in mixed case, which makes it hard to identify.
Instruct the Trie to convert the searchtext to lowercase to ease the
matching process. The lower-casing applies to keywords as well.
```java
Trie trie = Trie.builder()
.ignoreCase()
.addKeyword("casing")
.build();
Collection<Emit> emits = trie.parseText("CaSiNg");
```
Normally, this match would not be found. By calling `ignoreCase()`,
the entire search text is made lowercase before matching begins.
Therefore it will find exactly one match.
It is also possible to just ask whether the text matches any of
the keywords, or just to return the first match it finds.
```java
Trie trie = Trie.builder().ignoreOverlaps()
.addKeyword("ab")
.addKeyword("cba")
.addKeyword("ababc")
.build();
Emit firstMatch = trie.firstMatch("ababcbab");
```
The value for `firstMatch` will be "ababc" from position 0. The
`containsMatch()` method checks whether `firstMatch` found a match and
returns `true` if that is the case.
For a barebones Aho-Corasick algorithm with a custom emit handler use:
```java
Trie trie = Trie.builder()
.addKeyword("hers")
.addKeyword("his")
.addKeyword("she")
.addKeyword("he")
.build();
Collection<Emit> emits = trie.parseText("ushers");
```
You can now read the set. In this case it will find the following:
* "she" starting at position 1, ending at position 3
* "he" starting at position 2, ending at position 3
* "hers" starting at position 2, ending at position 5
final List<Emit> emits = new ArrayList<>();
EmitHandler emitHandler = new EmitHandler() {
In normal situations you probably want to remove overlapping instances, retaining the longest and left-most
matches.
```java
Trie trie = Trie.builder()
.ignoreOverlaps()
.addKeyword("hot")
.addKeyword("hot chocolate")
.build();
Collection<Emit> emits = trie.parseText("hot chocolate");
```
The ignoreOverlaps method tells the Trie to remove all overlapping matches. For this it relies on the following
conflict resolution rules: 1) longer matches prevail over shorter matches, 2) left-most prevails over right-most.
There is only one result now:
* "hot chocolate" starting at position 0, ending at position 12
If you want the algorithm to only check for whole words, you can tell the Trie to do so:
```java
Trie trie = Trie.builder()
.onlyWholeWords()
.addKeyword("sugar")
.build();
Collection<Emit> emits = trie.parseText("sugarcane sugarcane sugar canesugar");
```
In this case, it will only find one match, whereas it would normally find four. The sugarcane/canesugar words
are discarded because they are partial matches.
Some text is WrItTeN in a combination of lowercase and uppercase and therefore hard to identify. You can instruct
the Trie to lowercase the entire searchtext to ease the matching process. The lower-casing extends to keywords as well.
```java
Trie trie = Trie.builder()
.ignoreCase()
.addKeyword("casing")
.build();
Collection<Emit> emits = trie.parseText("CaSiNg");
```
Normally, this match would not be found. With the ignoreCase settings the entire search text is lowercased
before the matching begins. Therefore it will find exactly one match. Since you still have control of the original
search text and you will know exactly where the match was, you can still utilize the original casing.
It is also possible to just ask whether the text matches any of the keywords, or just to return the first match it
finds.
```java
Trie trie = Trie.builder().ignoreOverlaps()
.addKeyword("ab")
.addKeyword("cba")
.addKeyword("ababc")
.build();
Emit firstMatch = trie.firstMatch("ababcbab");
```
The firstMatch will now be "ababc" found at position 0. containsMatch just checks if there is a firstMatch and
returns true if that is the case.
If you just want the barebones Aho-Corasick algorithm (ie, no dealing with case insensitivity, overlaps and whole
words) and you prefer to add your own handler to the mix, that is also possible.
```java
Trie trie = Trie.builder()
.addKeyword("hers")
.addKeyword("his")
.addKeyword("she")
.addKeyword("he")
.build();
final List<Emit> emits = new ArrayList<>();
EmitHandler emitHandler = new EmitHandler() {
@Override
public void emit(Emit emit) {
emits.add(emit);
}
};
```
In many cases you may want to do useful stuff with both the non-matching and the matching text. In this case, you
might be better served by using the Trie.tokenize(). It allows you to loop over the entire text and deal with
matches as soon as you encounter them. Let's look at an example where we want to highlight words from HGttG in HTML:
```java
String speech = "The Answer to the Great Question... Of Life, " +
"the Universe and Everything... Is... Forty-two,' said " +
"Deep Thought, with infinite majesty and calm.";
Trie trie = Trie.builder().ignoreOverlaps().onlyWholeWords().ignoreCase()
.addKeyword("great question")
.addKeyword("forty-two")
.addKeyword("deep thought")
.build();
Collection<Token> tokens = trie.tokenize(speech);
StringBuffer html = new StringBuffer();
html.append("<html><body><p>");
for (Token token : tokens) {
if (token.isMatch()) {
html.append("<i>");
}
html.append(token.getFragment());
if (token.isMatch()) {
html.append("</i>");
}
@Override
public void emit(Emit emit) {
emits.add(emit);
}
html.append("</p></body></html>");
System.out.println(html);
};
```
In many cases you may want to do perform tasks with both the non-matching
and the matching text. Such implementations may be better served by using
`Trie.tokenize()`. The `tokenize()` method allows looping over the
corpus to deal with matches as soon as they are encountered. Here's an
example that outputs key words as italicized HTML elements:
```java
String speech = "The Answer to the Great Question... Of Life, " +
"the Universe and Everything... Is... Forty-two,' said " +
"Deep Thought, with infinite majesty and calm.";
Trie trie = Trie.builder().ignoreOverlaps().onlyWholeWords().ignoreCase()
.addKeyword("great question")
.addKeyword("forty-two")
.addKeyword("deep thought")
.build();
Collection<Token> tokens = trie.tokenize(speech);
StringBuilder html = new StringBuilder();
html.append("<html><body><p>");
for (Token token : tokens) {
if (token.isMatch()) {
html.append("<i>");
}
html.append(token.getFragment());
if (token.isMatch()) {
html.append("</i>");
}
}
html.append("</p></body></html>");
System.out.println(html);
```
You can also emit custom outputs. This might for example be useful to implement a trivial named entity
recognizer. In this case use a PayloadTrie instead of a Trie:
```java
class Word {
private final String gender;
public Word(String gender) {
this.gender = gender;
}
class Word {
private final String gender;
public Word(String gender) {
this.gender = gender;
}
PayloadTrie<Word> trie = PayloadTrie.<Word>builder()
.addKeyword("hers", new Word("f")
.addKeyword("his", new Word("m"))
.addKeyword("she", new Word("f"))
.addKeyword("he", new Word("m"))
.build();
Collection<PayloadEmit<Word>> emits = trie.parseText("ushers");
}
PayloadTrie<Word> trie = PayloadTrie.<Word>builder()
.addKeyword("hers", new Word("f")
.addKeyword("his", new Word("m"))
.addKeyword("she", new Word("f"))
.addKeyword("he", new Word("m"))
.addKeyword("nonbinary", new Word("nb"))
.addKeyword("transgender", new Word("tg"))
.build();
Collection<PayloadEmit<Word>> emits = trie.parseText("ushers");
```
Releases
--------
Information on the aho-corasick [releases](https://github.com/robert-bor/aho-corasick/releases).
See [releases](https://github.com/robert-bor/aho-corasick/releases) for details.
License
-------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
@ -219,3 +248,4 @@ License
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.