208 lines
16 KiB
HTML
208 lines
16 KiB
HTML
<!doctype html>
|
|
<html>
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta http-equiv="X-UA-Compatible" content="chrome=1">
|
|
<title>Aho-Corasick by robert-bor</title>
|
|
<link rel="stylesheet" href="stylesheets/styles.css">
|
|
<link rel="stylesheet" href="stylesheets/pygment_trac.css">
|
|
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
|
|
<script src="javascripts/respond.js"></script>
|
|
<!--[if lt IE 9]>
|
|
<script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
|
|
<![endif]-->
|
|
<!--[if lt IE 8]>
|
|
<link rel="stylesheet" href="stylesheets/ie.css">
|
|
<![endif]-->
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">
|
|
|
|
</head>
|
|
<body>
|
|
<div id="header">
|
|
<nav>
|
|
<li class="fork"><a href="https://github.com/robert-bor/aho-corasick">View On GitHub</a></li>
|
|
<li class="downloads"><a href="https://github.com/robert-bor/aho-corasick/zipball/master">ZIP</a></li>
|
|
<li class="downloads"><a href="https://github.com/robert-bor/aho-corasick/tarball/master">TAR</a></li>
|
|
<li class="title">DOWNLOADS</li>
|
|
</nav>
|
|
</div><!-- end header -->
|
|
|
|
<div class="wrapper">
|
|
|
|
<section>
|
|
<div id="title">
|
|
<h1>Aho-Corasick</h1>
|
|
<p>Java implementation of the Aho-Corasick algorithm for efficient string matching</p>
|
|
<hr>
|
|
<span class="credits left">Project maintained by <a href="https://github.com/robert-bor">robert-bor</a></span>
|
|
<span class="credits right">Hosted on GitHub Pages — Theme by <a href="https://twitter.com/michigangraham">mattgraham</a></span>
|
|
</div>
|
|
|
|
<h1>
|
|
<a name="aho-corasick" class="anchor" href="#aho-corasick"><span class="octicon octicon-link"></span></a>Aho-Corasick</h1>
|
|
|
|
<h2>
|
|
<a name="dependency" class="anchor" href="#dependency"><span class="octicon octicon-link"></span></a>Dependency</h2>
|
|
|
|
<p>Include this dependency in your POM. Be sure to check for the latest version in Maven Central.</p>
|
|
|
|
<div class="highlight highlight-xml"><pre> <span class="nt"><dependency></span>
|
|
<span class="nt"><groupId></span>org.ahocorasick<span class="nt"></groupId></span>
|
|
<span class="nt"><artifactId></span>ahocorasick<span class="nt"></artifactId></span>
|
|
<span class="nt"><version></span>0.2.4<span class="nt"></version></span>
|
|
<span class="nt"></dependency></span>
|
|
</pre></div>
|
|
|
|
<h2>
|
|
<a name="introduction" class="anchor" href="#introduction"><span class="octicon octicon-link"></span></a>Introduction</h2>
|
|
|
|
<p>Nowadays most free-text searching is based on Lucene-like approaches, where the search text is parsed into its
|
|
various components. For every keyword a lookup is done to see where it occurs. When looking for a couple of keywords
|
|
this approach is great. But what about it if you are not looking for just a couple of keywords, but a 100,000 of
|
|
them? Like, for example, checking against a dictionary?</p>
|
|
|
|
<p>This is where the Aho-Corasick algorithm shines. Instead of chopping up the search text, it uses all the keywords
|
|
to build up a construct called a <a href="http://en.wikipedia.org/wiki/Trie">Trie</a>. There are three crucial components
|
|
to Aho-Corasick:</p>
|
|
|
|
<ul>
|
|
<li>goto</li>
|
|
<li>fail</li>
|
|
<li>output</li>
|
|
</ul><p>Every character encountered is presented to a state object within the <em>goto</em> structure. If there is a matching state,
|
|
that will be elevated to the new current state.</p>
|
|
|
|
<p>However, if there is no matching state, the algorithm will signal a <em>fail</em> and fall back to states with less depth
|
|
(ie, a match less long) and proceed from there, until it found a matching state, or it has reached the root state.</p>
|
|
|
|
<p>Whenever a state is reached that matches an entire keyword, it is emitted to an <em>output</em> set which can be read after
|
|
the entire scan has completed.</p>
|
|
|
|
<p>The beauty of the algorithm is that it is O(n). No matter how many keywords you have, or how big the search text is,
|
|
the performance will decline in a linear way.</p>
|
|
|
|
<p>Some examples you could use the Aho-Corasick algorithm for:</p>
|
|
|
|
<ul>
|
|
<li>looking for certain words in texts in order to URL link or emphasize them</li>
|
|
<li>adding semantics to plain text</li>
|
|
<li>checking against a dictionary to see if syntactic errors were made</li>
|
|
</ul><p>This library is the Java implementation of the afore-mentioned Aho-Corasick algorithm for efficient string matching.
|
|
The algorithm is explained in great detail in the white paper written by
|
|
Aho and Corasick: <a>ftp://163.13.200.222/assistant/bearhero/prog/%A8%E4%A5%A6/ac_bm.pdf</a></p>
|
|
|
|
<h2>
|
|
<a name="usage" class="anchor" href="#usage"><span class="octicon octicon-link"></span></a>Usage</h2>
|
|
|
|
<p>Setting up the Trie is a piece of cake:</p>
|
|
|
|
<div class="highlight highlight-java"><pre> <span class="n">Trie</span> <span class="n">trie</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Trie</span><span class="o">();</span>
|
|
<span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"hers"</span><span class="o">);</span>
|
|
<span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"his"</span><span class="o">);</span>
|
|
<span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"she"</span><span class="o">);</span>
|
|
<span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"he"</span><span class="o">);</span>
|
|
<span class="n">Collection</span><span class="o"><</span><span class="n">Emit</span><span class="o">></span> <span class="n">emits</span> <span class="o">=</span> <span class="n">trie</span><span class="o">.</span><span class="na">parseText</span><span class="o">(</span><span class="s">"ushers"</span><span class="o">);</span>
|
|
</pre></div>
|
|
|
|
<p>You can now read the set. In this case it will find the following:</p>
|
|
|
|
<ul>
|
|
<li>"she" starting at position 1, ending at position 3</li>
|
|
<li>"he" starting at position 2, ending at position 3</li>
|
|
<li>"hers" starting at position 2, ending at position 5</li>
|
|
</ul><p>In normal situations you probably want to remove overlapping instances, retaining the longest and left-most
|
|
matches.</p>
|
|
|
|
<div class="highlight highlight-java"><pre> <span class="n">Trie</span> <span class="n">trie</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Trie</span><span class="o">().</span><span class="na">removeOverlaps</span><span class="o">();</span>
|
|
<span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"hot"</span><span class="o">);</span>
|
|
<span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"hot chocolate"</span><span class="o">);</span>
|
|
<span class="n">Collection</span><span class="o"><</span><span class="n">Emit</span><span class="o">></span> <span class="n">emits</span> <span class="o">=</span> <span class="n">trie</span><span class="o">.</span><span class="na">parseText</span><span class="o">(</span><span class="s">"hot chocolate"</span><span class="o">);</span>
|
|
</pre></div>
|
|
|
|
<p>The removeOverlaps method tells the Trie to remove all overlapping matches. For this it relies on the following
|
|
conflict resolution rules: 1) longer matches prevail over shorter matches, 2) left-most prevails over right-most.
|
|
There is only one result now:</p>
|
|
|
|
<ul>
|
|
<li>"hot chocolate" starting at position 0, ending at position 12</li>
|
|
</ul>
|
|
|
|
<div class="highlight highlight-java"><pre> <span class="n">Trie</span> <span class="n">trie</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Trie</span><span class="o">().</span><span class="na">onlyWholeWords</span><span class="o">();</span>
|
|
<span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"sugar"</span><span class="o">);</span>
|
|
<span class="n">Collection</span><span class="o"><</span><span class="n">Emit</span><span class="o">></span> <span class="n">emits</span> <span class="o">=</span> <span class="n">trie</span><span class="o">.</span><span class="na">parseText</span><span class="o">(</span><span class="s">"sugarcane sugarcane sugar canesugar"</span><span class="o">);</span>
|
|
</pre></div>
|
|
|
|
<p>In this case, it will only find one match, whereas it would normally find four. The sugarcane/canesugar words
|
|
are discarded because they are partial matches.</p>
|
|
|
|
<p>Some text is WrItTeN in combinations of lowercase and uppercase and therefore hard to identify. You can instruct
|
|
the Trie to lowercase the entire searchtext to ease the matching process.</p>
|
|
|
|
<div class="highlight highlight-java"><pre> <span class="n">Trie</span> <span class="n">trie</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Trie</span><span class="o">().</span><span class="na">caseInsensitive</span><span class="o">();</span>
|
|
<span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"casing"</span><span class="o">);</span>
|
|
<span class="n">Collection</span><span class="o"><</span><span class="n">Emit</span><span class="o">></span> <span class="n">emits</span> <span class="o">=</span> <span class="n">trie</span><span class="o">.</span><span class="na">parseText</span><span class="o">(</span><span class="s">"CaSiNg"</span><span class="o">);</span>
|
|
</pre></div>
|
|
|
|
<p>Normally, this match would not be found. With the caseInsensitive settings the entire search text is lowercased
|
|
before the matching begins. Therefore it will find exactly one match. Since you still have control of the original
|
|
search text and you will know exactly where the match was, you can still utilize the original casing.</p>
|
|
|
|
<p>In many cases you may want to do useful stuff with both the non-matching and the matching text. In this case, you
|
|
might be better served by using the Trie.tokenize(). It allows you to loop over the entire text and deal with
|
|
matches as soon as you encounter them. Let's look at an example where we want to highlight words from HGttG in HTML:</p>
|
|
|
|
<div class="highlight highlight-java"><pre> <span class="n">String</span> <span class="n">speech</span> <span class="o">=</span> <span class="s">"The Answer to the Great Question... Of Life, "</span> <span class="o">+</span>
|
|
<span class="s">"the Universe and Everything... Is... Forty-two,' said "</span> <span class="o">+</span>
|
|
<span class="s">"Deep Thought, with infinite majesty and calm."</span><span class="o">;</span>
|
|
<span class="n">Trie</span> <span class="n">trie</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Trie</span><span class="o">().</span><span class="na">removeOverlaps</span><span class="o">().</span><span class="na">onlyWholeWords</span><span class="o">().</span><span class="na">caseInsensitive</span><span class="o">();</span>
|
|
<span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"great question"</span><span class="o">);</span>
|
|
<span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"forty-two"</span><span class="o">);</span>
|
|
<span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"deep thought"</span><span class="o">);</span>
|
|
<span class="n">Collection</span><span class="o"><</span><span class="n">Token</span><span class="o">></span> <span class="n">tokens</span> <span class="o">=</span> <span class="n">trie</span><span class="o">.</span><span class="na">tokenize</span><span class="o">(</span><span class="n">speech</span><span class="o">);</span>
|
|
<span class="n">StringBuffer</span> <span class="n">html</span> <span class="o">=</span> <span class="k">new</span> <span class="n">StringBuffer</span><span class="o">();</span>
|
|
<span class="n">html</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">"<html><body><p>"</span><span class="o">);</span>
|
|
<span class="k">for</span> <span class="o">(</span><span class="n">Token</span> <span class="n">token</span> <span class="o">:</span> <span class="n">tokens</span><span class="o">)</span> <span class="o">{</span>
|
|
<span class="k">if</span> <span class="o">(</span><span class="n">token</span><span class="o">.</span><span class="na">isMatch</span><span class="o">())</span> <span class="o">{</span>
|
|
<span class="n">html</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">"<i>"</span><span class="o">);</span>
|
|
<span class="o">}</span>
|
|
<span class="n">html</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="n">token</span><span class="o">.</span><span class="na">getFragment</span><span class="o">());</span>
|
|
<span class="k">if</span> <span class="o">(</span><span class="n">token</span><span class="o">.</span><span class="na">isMatch</span><span class="o">())</span> <span class="o">{</span>
|
|
<span class="n">html</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">"</i>"</span><span class="o">);</span>
|
|
<span class="o">}</span>
|
|
<span class="o">}</span>
|
|
<span class="n">html</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">"</p></body></html>"</span><span class="o">);</span>
|
|
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">html</span><span class="o">);</span>
|
|
</pre></div>
|
|
|
|
<h2>
|
|
<a name="license" class="anchor" href="#license"><span class="octicon octicon-link"></span></a>License</h2>
|
|
|
|
<p>Licensed under the Apache License, Version 2.0 (the "License");
|
|
you may not use this file except in compliance with the License.
|
|
You may obtain a copy of the License at</p>
|
|
|
|
<pre><code>http://www.apache.org/licenses/LICENSE-2.0
|
|
</code></pre>
|
|
|
|
<p>Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.</p>
|
|
</section>
|
|
|
|
</div>
|
|
<!--[if !IE]><script>fixScale(document);</script><![endif]-->
|
|
<script type="text/javascript">
|
|
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
|
|
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
|
|
</script>
|
|
<script type="text/javascript">
|
|
try {
|
|
var pageTracker = _gat._getTracker("UA-47586863-1");
|
|
pageTracker._trackPageview();
|
|
} catch(err) {}
|
|
</script>
|
|
|
|
</body>
|
|
</html> |