added explanation on the algorithm
This commit is contained in:
parent
d140afc0da
commit
e1da9bd274
63
README.md
63
README.md
@ -1,4 +1,63 @@
|
|||||||
aho-corasick
|
Aho-Corasick
|
||||||
============
|
============
|
||||||
|
|
||||||
Java implementation of the Aho-Corasick algorithm for efficient string matching
|
Introduction
|
||||||
|
------------
|
||||||
|
Nowadays most free-text searching is based on Lucene-like approaches, where the search text is parsed into its
|
||||||
|
various components. For every keyword a lookup is done to see where it occurs. When looking for a couple of keywords
|
||||||
|
this approach is great. But what about it if you are not looking for just a couple of keywords, but a 100,000 of
|
||||||
|
them? Like, for example, checking against a dictionary?
|
||||||
|
|
||||||
|
This is where the Aho-Corasick algorithm shines. Instead of chopping up the search text, it uses all the keywords
|
||||||
|
to build up a construct called a [Trie](http://en.wikipedia.org/wiki/Trie). There are three crucial components
|
||||||
|
to Aho-Corasick:
|
||||||
|
* goto
|
||||||
|
* fail
|
||||||
|
* output
|
||||||
|
|
||||||
|
Every character encountered is presented to a state object within the goto structure. If there is a matching state,
|
||||||
|
that will be elevated to the new current state.
|
||||||
|
|
||||||
|
However, if there is no matching state, the algorithm will fall back to states with less depth (ie, a match less long)
|
||||||
|
and proceed from there, until it found a matching state, or it has reached the root state.
|
||||||
|
|
||||||
|
Whenever a state is reached that matches an entire keyword, it is emitted to an output set which can be read after the
|
||||||
|
entire scan has completed.
|
||||||
|
|
||||||
|
The beauty of the algorithm is that it is O(n). No matter how many keywords you have, or how big the search text is,
|
||||||
|
the performance will decline in a linear way.
|
||||||
|
|
||||||
|
Some examples you could use the Aho-Corasick algorithm for:
|
||||||
|
* looking for certain words in texts in order to URL link or emphasize them
|
||||||
|
* adding semantics to plain text
|
||||||
|
* checking against a dictionary to see if syntactic errors were made
|
||||||
|
|
||||||
|
This library is the Java implementation of the afore-mentioned Aho-Corasick algorithm for efficient string matching.
|
||||||
|
The algorithm is explained in great detail in the white paper written by Aho and Corasick:
|
||||||
|
ftp://163.13.200.222/assistant/bearhero/prog/%A8%E4%A5%A6/ac_bm.pdf.
|
||||||
|
|
||||||
|
Usage
|
||||||
|
-----
|
||||||
|
|
||||||
|
```java
|
||||||
|
Trie trie = new Trie();
|
||||||
|
trie.addKeyword("hers");
|
||||||
|
trie.addKeyword("his");
|
||||||
|
trie.addKeyword("she");
|
||||||
|
trie.addKeyword("he");
|
||||||
|
Collection<Emit> emits = trie.parseText("ushers");
|
||||||
|
```
|
||||||
|
|
||||||
|
License
|
||||||
|
-------
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user