Simplified README text, examples, and corrected typos
This commit is contained in:
parent
9f80565b53
commit
864df8140f
342
README.md
342
README.md
@ -10,204 +10,233 @@ Aho-Corasick
|
||||
|
||||
Dependency
|
||||
----------
|
||||
|
||||
Include this dependency in your POM. Be sure to check for the latest version in Maven Central.
|
||||
|
||||
```xml
|
||||
<dependency>
|
||||
<groupId>org.ahocorasick</groupId>
|
||||
<artifactId>ahocorasick</artifactId>
|
||||
<version>0.4.0</version>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>org.ahocorasick</groupId>
|
||||
<artifactId>ahocorasick</artifactId>
|
||||
<version>0.5.0</version>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
Introduction
|
||||
------------
|
||||
Nowadays most free-text searching is based on Lucene-like approaches, where the search text is parsed into its
|
||||
various components. For every keyword a lookup is done to see where it occurs. When looking for a couple of keywords
|
||||
this approach is great. But what about it if you are not looking for just a couple of keywords, but a 100,000 of
|
||||
them? Like, for example, checking against a dictionary?
|
||||
|
||||
This is where the Aho-Corasick algorithm shines. Instead of chopping up the search text, it uses all the keywords
|
||||
to build up a construct called a [Trie](http://en.wikipedia.org/wiki/Trie). There are three crucial components
|
||||
to Aho-Corasick:
|
||||
Most free-text searching is based on Lucene-like approaches, where the
|
||||
search text is parsed into its various components. For every keyword a
|
||||
lookup is done to see where it occurs. When looking for a couple of keywords
|
||||
this approach is great, but when searching for 100,000 words, the approach
|
||||
is quite slow (for example, checking against a dictionary).
|
||||
|
||||
The Aho-Corasick algorithm shines when looking for multiple words.
|
||||
Rather than chop up the search text, it uses all the keywords to build
|
||||
a [Trie](http://en.wikipedia.org/wiki/Trie) construct. The crucial
|
||||
Aho-Corasick components include:
|
||||
|
||||
* goto
|
||||
* fail
|
||||
* output
|
||||
|
||||
Every character encountered is presented to a state object within the *goto* structure. If there is a matching state,
|
||||
that will be elevated to the new current state.
|
||||
Every character encountered is presented to a state object within the
|
||||
*goto* structure. If there is a matching state, that will be elevated to
|
||||
the new current state.
|
||||
|
||||
However, if there is no matching state, the algorithm will signal a *fail* and fall back to states with less depth
|
||||
(ie, a match less long) and proceed from there, until it found a matching state, or it has reached the root state.
|
||||
However, if there is no matching state, the algorithm will signal a
|
||||
*fail* and fall back to states with less depth (i.e.,, a match less long)
|
||||
and proceed from there, until it found a matching state, or it has reached
|
||||
the root state.
|
||||
|
||||
Whenever a state is reached that matches an entire keyword, it is emitted to an *output* set which can be read after
|
||||
the entire scan has completed.
|
||||
Whenever a state is reached that matches an entire keyword, it is
|
||||
emitted to an *output* set which can be read after the entire scan
|
||||
has completed.
|
||||
|
||||
The beauty of the algorithm is that it is O(n). No matter how many keywords you have, or how big the search text is,
|
||||
the performance will decline in a linear way.
|
||||
The algorithm is O(n). No matter how many keywords are given, or how large
|
||||
the search text is, the performance will decline linearly.
|
||||
|
||||
Some examples you could use the Aho-Corasick algorithm for:
|
||||
* looking for certain words in texts in order to URL link or emphasize them
|
||||
* adding semantics to plain text
|
||||
* checking against a dictionary to see if syntactic errors were made
|
||||
The Aho-Corasick algorithm can help:
|
||||
|
||||
This library is the Java implementation of the afore-mentioned Aho-Corasick algorithm for efficient string matching.
|
||||
The algorithm is explained in great detail in the white paper written by
|
||||
Aho and Corasick: http://cr.yp.to/bib/1975/aho.pdf
|
||||
* find words in texts to link or emphasize them;
|
||||
* add semantics to plain text; or
|
||||
* checki against a dictionary to see if syntactic errors were made.
|
||||
|
||||
See the [white paper](http://cr.yp.to/bib/1975/aho.pdf) by Aho and
|
||||
Corasick for algorithmic details.
|
||||
|
||||
Usage
|
||||
-----
|
||||
Setting up the Trie is a piece of cake:
|
||||
Set up the Trie using a builder as follows:
|
||||
|
||||
```java
|
||||
Trie trie = Trie.builder()
|
||||
Trie trie = Trie.builder()
|
||||
.addKeyword("hers")
|
||||
.addKeyword("his")
|
||||
.addKeyword("she")
|
||||
.addKeyword("he")
|
||||
.build();
|
||||
Collection<Emit> emits = trie.parseText("ushers");
|
||||
```
|
||||
|
||||
The collection will contain `Emit` objects that match:
|
||||
|
||||
* "she" starting at position 1, ending at position 3
|
||||
* "he" starting at position 2, ending at position 3
|
||||
* "hers" starting at position 2, ending at position 5
|
||||
|
||||
In situations where overlapping instances are not desired, retain
|
||||
the longest and left-most matches by calling `ignoreOverlaps()`:
|
||||
|
||||
```java
|
||||
Trie trie = Trie.builder()
|
||||
.ignoreOverlaps()
|
||||
.addKeyword("hot")
|
||||
.addKeyword("hot chocolate")
|
||||
.build();
|
||||
Collection<Emit> emits = trie.parseText("hot chocolate");
|
||||
```
|
||||
|
||||
The `ignoreOverlaps()` method tells the Trie to remove all overlapping
|
||||
matches. For this it relies on the following conflict resolution rules:
|
||||
|
||||
1. longer matches prevail over shorter matches; and
|
||||
1. left-most prevails over right-most.
|
||||
|
||||
Only one result is returned:
|
||||
|
||||
* "hot chocolate" starting at position 0, ending at position 12
|
||||
|
||||
To check for whole words exclusively, call `onlyWholeWords()` as follows:
|
||||
|
||||
```java
|
||||
Trie trie = Trie.builder()
|
||||
.onlyWholeWords()
|
||||
.addKeyword("sugar")
|
||||
.build();
|
||||
Collection<Emit> emits = trie.parseText("sugarcane sugar canesugar");
|
||||
```
|
||||
|
||||
Only one match is found; whereas, without calling `onlyWholeWords()` four
|
||||
matches are found. The sugarcane/canesugar words are discarded because
|
||||
they are partial matches.
|
||||
|
||||
Some text is `WrItTeN` in mixed case, which makes it hard to identify.
|
||||
Instruct the Trie to convert the searchtext to lowercase to ease the
|
||||
matching process. The lower-casing applies to keywords as well.
|
||||
|
||||
```java
|
||||
Trie trie = Trie.builder()
|
||||
.ignoreCase()
|
||||
.addKeyword("casing")
|
||||
.build();
|
||||
Collection<Emit> emits = trie.parseText("CaSiNg");
|
||||
```
|
||||
|
||||
Normally, this match would not be found. By calling `ignoreCase()`,
|
||||
the entire search text is made lowercase before matching begins.
|
||||
Therefore it will find exactly one match.
|
||||
|
||||
It is also possible to just ask whether the text matches any of
|
||||
the keywords, or just to return the first match it finds.
|
||||
|
||||
```java
|
||||
Trie trie = Trie.builder().ignoreOverlaps()
|
||||
.addKeyword("ab")
|
||||
.addKeyword("cba")
|
||||
.addKeyword("ababc")
|
||||
.build();
|
||||
Emit firstMatch = trie.firstMatch("ababcbab");
|
||||
```
|
||||
|
||||
The value for `firstMatch` will be "ababc" from position 0. The
|
||||
`containsMatch()` method checks whether `firstMatch` found a match and
|
||||
returns `true` if that is the case.
|
||||
|
||||
For a barebones Aho-Corasick algorithm with a custom emit handler use:
|
||||
|
||||
```java
|
||||
Trie trie = Trie.builder()
|
||||
.addKeyword("hers")
|
||||
.addKeyword("his")
|
||||
.addKeyword("she")
|
||||
.addKeyword("he")
|
||||
.build();
|
||||
Collection<Emit> emits = trie.parseText("ushers");
|
||||
```
|
||||
|
||||
You can now read the set. In this case it will find the following:
|
||||
* "she" starting at position 1, ending at position 3
|
||||
* "he" starting at position 2, ending at position 3
|
||||
* "hers" starting at position 2, ending at position 5
|
||||
final List<Emit> emits = new ArrayList<>();
|
||||
EmitHandler emitHandler = new EmitHandler() {
|
||||
|
||||
In normal situations you probably want to remove overlapping instances, retaining the longest and left-most
|
||||
matches.
|
||||
|
||||
```java
|
||||
Trie trie = Trie.builder()
|
||||
.ignoreOverlaps()
|
||||
.addKeyword("hot")
|
||||
.addKeyword("hot chocolate")
|
||||
.build();
|
||||
Collection<Emit> emits = trie.parseText("hot chocolate");
|
||||
```
|
||||
|
||||
The ignoreOverlaps method tells the Trie to remove all overlapping matches. For this it relies on the following
|
||||
conflict resolution rules: 1) longer matches prevail over shorter matches, 2) left-most prevails over right-most.
|
||||
There is only one result now:
|
||||
* "hot chocolate" starting at position 0, ending at position 12
|
||||
|
||||
If you want the algorithm to only check for whole words, you can tell the Trie to do so:
|
||||
|
||||
```java
|
||||
Trie trie = Trie.builder()
|
||||
.onlyWholeWords()
|
||||
.addKeyword("sugar")
|
||||
.build();
|
||||
Collection<Emit> emits = trie.parseText("sugarcane sugarcane sugar canesugar");
|
||||
```
|
||||
|
||||
In this case, it will only find one match, whereas it would normally find four. The sugarcane/canesugar words
|
||||
are discarded because they are partial matches.
|
||||
|
||||
Some text is WrItTeN in a combination of lowercase and uppercase and therefore hard to identify. You can instruct
|
||||
the Trie to lowercase the entire searchtext to ease the matching process. The lower-casing extends to keywords as well.
|
||||
|
||||
```java
|
||||
Trie trie = Trie.builder()
|
||||
.ignoreCase()
|
||||
.addKeyword("casing")
|
||||
.build();
|
||||
Collection<Emit> emits = trie.parseText("CaSiNg");
|
||||
```
|
||||
|
||||
Normally, this match would not be found. With the ignoreCase settings the entire search text is lowercased
|
||||
before the matching begins. Therefore it will find exactly one match. Since you still have control of the original
|
||||
search text and you will know exactly where the match was, you can still utilize the original casing.
|
||||
|
||||
It is also possible to just ask whether the text matches any of the keywords, or just to return the first match it
|
||||
finds.
|
||||
|
||||
```java
|
||||
Trie trie = Trie.builder().ignoreOverlaps()
|
||||
.addKeyword("ab")
|
||||
.addKeyword("cba")
|
||||
.addKeyword("ababc")
|
||||
.build();
|
||||
Emit firstMatch = trie.firstMatch("ababcbab");
|
||||
```
|
||||
|
||||
The firstMatch will now be "ababc" found at position 0. containsMatch just checks if there is a firstMatch and
|
||||
returns true if that is the case.
|
||||
|
||||
If you just want the barebones Aho-Corasick algorithm (ie, no dealing with case insensitivity, overlaps and whole
|
||||
words) and you prefer to add your own handler to the mix, that is also possible.
|
||||
|
||||
```java
|
||||
Trie trie = Trie.builder()
|
||||
.addKeyword("hers")
|
||||
.addKeyword("his")
|
||||
.addKeyword("she")
|
||||
.addKeyword("he")
|
||||
.build();
|
||||
|
||||
final List<Emit> emits = new ArrayList<>();
|
||||
EmitHandler emitHandler = new EmitHandler() {
|
||||
|
||||
@Override
|
||||
public void emit(Emit emit) {
|
||||
emits.add(emit);
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
In many cases you may want to do useful stuff with both the non-matching and the matching text. In this case, you
|
||||
might be better served by using the Trie.tokenize(). It allows you to loop over the entire text and deal with
|
||||
matches as soon as you encounter them. Let's look at an example where we want to highlight words from HGttG in HTML:
|
||||
|
||||
```java
|
||||
String speech = "The Answer to the Great Question... Of Life, " +
|
||||
"the Universe and Everything... Is... Forty-two,' said " +
|
||||
"Deep Thought, with infinite majesty and calm.";
|
||||
Trie trie = Trie.builder().ignoreOverlaps().onlyWholeWords().ignoreCase()
|
||||
.addKeyword("great question")
|
||||
.addKeyword("forty-two")
|
||||
.addKeyword("deep thought")
|
||||
.build();
|
||||
Collection<Token> tokens = trie.tokenize(speech);
|
||||
StringBuffer html = new StringBuffer();
|
||||
html.append("<html><body><p>");
|
||||
for (Token token : tokens) {
|
||||
if (token.isMatch()) {
|
||||
html.append("<i>");
|
||||
}
|
||||
html.append(token.getFragment());
|
||||
if (token.isMatch()) {
|
||||
html.append("</i>");
|
||||
}
|
||||
@Override
|
||||
public void emit(Emit emit) {
|
||||
emits.add(emit);
|
||||
}
|
||||
html.append("</p></body></html>");
|
||||
System.out.println(html);
|
||||
};
|
||||
```
|
||||
|
||||
In many cases you may want to do perform tasks with both the non-matching
|
||||
and the matching text. Such implementations may be better served by using
|
||||
`Trie.tokenize()`. The `tokenize()` method allows looping over the
|
||||
corpus to deal with matches as soon as they are encountered. Here's an
|
||||
example that outputs key words as italicized HTML elements:
|
||||
|
||||
```java
|
||||
String speech = "The Answer to the Great Question... Of Life, " +
|
||||
"the Universe and Everything... Is... Forty-two,' said " +
|
||||
"Deep Thought, with infinite majesty and calm.";
|
||||
|
||||
Trie trie = Trie.builder().ignoreOverlaps().onlyWholeWords().ignoreCase()
|
||||
.addKeyword("great question")
|
||||
.addKeyword("forty-two")
|
||||
.addKeyword("deep thought")
|
||||
.build();
|
||||
|
||||
Collection<Token> tokens = trie.tokenize(speech);
|
||||
StringBuilder html = new StringBuilder();
|
||||
html.append("<html><body><p>");
|
||||
|
||||
for (Token token : tokens) {
|
||||
if (token.isMatch()) {
|
||||
html.append("<i>");
|
||||
}
|
||||
html.append(token.getFragment());
|
||||
if (token.isMatch()) {
|
||||
html.append("</i>");
|
||||
}
|
||||
}
|
||||
|
||||
html.append("</p></body></html>");
|
||||
System.out.println(html);
|
||||
```
|
||||
|
||||
You can also emit custom outputs. This might for example be useful to implement a trivial named entity
|
||||
recognizer. In this case use a PayloadTrie instead of a Trie:
|
||||
|
||||
```java
|
||||
class Word {
|
||||
private final String gender;
|
||||
public Word(String gender) {
|
||||
this.gender = gender;
|
||||
}
|
||||
class Word {
|
||||
private final String gender;
|
||||
public Word(String gender) {
|
||||
this.gender = gender;
|
||||
}
|
||||
|
||||
PayloadTrie<Word> trie = PayloadTrie.<Word>builder()
|
||||
.addKeyword("hers", new Word("f")
|
||||
.addKeyword("his", new Word("m"))
|
||||
.addKeyword("she", new Word("f"))
|
||||
.addKeyword("he", new Word("m"))
|
||||
.build();
|
||||
Collection<PayloadEmit<Word>> emits = trie.parseText("ushers");
|
||||
}
|
||||
|
||||
PayloadTrie<Word> trie = PayloadTrie.<Word>builder()
|
||||
.addKeyword("hers", new Word("f")
|
||||
.addKeyword("his", new Word("m"))
|
||||
.addKeyword("she", new Word("f"))
|
||||
.addKeyword("he", new Word("m"))
|
||||
.addKeyword("nonbinary", new Word("nb"))
|
||||
.addKeyword("transgender", new Word("tg"))
|
||||
.build();
|
||||
Collection<PayloadEmit<Word>> emits = trie.parseText("ushers");
|
||||
```
|
||||
|
||||
Releases
|
||||
--------
|
||||
Information on the aho-corasick [releases](https://github.com/robert-bor/aho-corasick/releases).
|
||||
|
||||
See [releases](https://github.com/robert-bor/aho-corasick/releases) for details.
|
||||
|
||||
License
|
||||
-------
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
@ -219,3 +248,4 @@ License
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user