aho-corasick/index.html

<!doctype html>
<html>
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="chrome=1">
    <title>Aho-Corasick by robert-bor</title>
    <link rel="stylesheet" href="stylesheets/styles.css">
    <link rel="stylesheet" href="stylesheets/pygment_trac.css">
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
    <script src="javascripts/respond.js"></script>
    <!--[if lt IE 9]>
      <script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
    <!--[if lt IE 8]>
    <link rel="stylesheet" href="stylesheets/ie.css">
    <![endif]-->
    <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">

  </head>
  <body>
      <div id="header">
        <nav>
          <li class="fork"><a href="https://github.com/robert-bor/aho-corasick">View On GitHub</a></li>
          <li class="downloads"><a href="https://github.com/robert-bor/aho-corasick/zipball/master">ZIP</a></li>
          <li class="downloads"><a href="https://github.com/robert-bor/aho-corasick/tarball/master">TAR</a></li>
          <li class="title">DOWNLOADS</li>
        </nav>
      </div><!-- end header -->

    <div class="wrapper">

      <section>
        <div id="title">
          <h1>Aho-Corasick</h1>
          <p>Java implementation of the Aho-Corasick algorithm for efficient string matching</p>
          <hr>
          <span class="credits left">Project maintained by <a href="https://github.com/robert-bor">robert-bor</a></span>
          <span class="credits right">Hosted on GitHub Pages &mdash; Theme by <a href="https://twitter.com/michigangraham">mattgraham</a></span>
        </div>

        <h1>
<a name="aho-corasick" class="anchor" href="#aho-corasick"><span class="octicon octicon-link"></span></a>Aho-Corasick</h1>

          <h2>
              <a name="dependency" class="anchor" href="#dependency"><span class="octicon octicon-link"></span></a>Dependency</h2>

          <p>Include this dependency in your POM. Be sure to check for the latest version in Maven Central.</p>

          <div class="highlight highlight-xml"><pre>    <span class="nt">&lt;dependency&gt;</span>
        <span class="nt">&lt;groupId&gt;</span>org.ahocorasick<span class="nt">&lt;/groupId&gt;</span>
        <span class="nt">&lt;artifactId&gt;</span>ahocorasick<span class="nt">&lt;/artifactId&gt;</span>
        <span class="nt">&lt;version&gt;</span>0.2.4<span class="nt">&lt;/version&gt;</span>
    <span class="nt">&lt;/dependency&gt;</span>
</pre></div>

          <h2>
<a name="introduction" class="anchor" href="#introduction"><span class="octicon octicon-link"></span></a>Introduction</h2>

          <p>Nowadays most free-text searching is based on Lucene-like approaches, where the search text is parsed into its
              various components. For every keyword a lookup is done to see where it occurs. When looking for a couple of keywords
              this approach is great. But what about it if you are not looking for just a couple of keywords, but a 100,000 of
              them? Like, for example, checking against a dictionary?</p>

          <p>This is where the Aho-Corasick algorithm shines. Instead of chopping up the search text, it uses all the keywords
              to build up a construct called a <a href="http://en.wikipedia.org/wiki/Trie">Trie</a>. There are three crucial components
              to Aho-Corasick:</p>

          <ul>
              <li>goto</li>
              <li>fail</li>
              <li>output</li>
          </ul><p>Every character encountered is presented to a state object within the <em>goto</em> structure. If there is a matching state,
          that will be elevated to the new current state.</p>

          <p>However, if there is no matching state, the algorithm will signal a <em>fail</em> and fall back to states with less depth
              (ie, a match less long) and proceed from there, until it found a matching state, or it has reached the root state.</p>

          <p>Whenever a state is reached that matches an entire keyword, it is emitted to an <em>output</em> set which can be read after
              the entire scan has completed.</p>

          <p>The beauty of the algorithm is that it is O(n). No matter how many keywords you have, or how big the search text is,
              the performance will decline in a linear way.</p>

          <p>Some examples you could use the Aho-Corasick algorithm for:</p>

          <ul>
              <li>looking for certain words in texts in order to URL link or emphasize them</li>
              <li>adding semantics to plain text</li>
              <li>checking against a dictionary to see if syntactic errors were made</li>
          </ul><p>This library is the Java implementation of the afore-mentioned Aho-Corasick algorithm for efficient string matching.
          The algorithm is explained in great detail in the white paper written by
          Aho and Corasick: <a>ftp://163.13.200.222/assistant/bearhero/prog/%A8%E4%A5%A6/ac_bm.pdf</a></p>

          <h2>
              <a name="usage" class="anchor" href="#usage"><span class="octicon octicon-link"></span></a>Usage</h2>

          <p>Setting up the Trie is a piece of cake:</p>

          <div class="highlight highlight-java"><pre>    <span class="n">Trie</span> <span class="n">trie</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Trie</span><span class="o">();</span>
    <span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"hers"</span><span class="o">);</span>
    <span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"his"</span><span class="o">);</span>
    <span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"she"</span><span class="o">);</span>
    <span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"he"</span><span class="o">);</span>
    <span class="n">Collection</span><span class="o">&lt;</span><span class="n">Emit</span><span class="o">&gt;</span> <span class="n">emits</span> <span class="o">=</span> <span class="n">trie</span><span class="o">.</span><span class="na">parseText</span><span class="o">(</span><span class="s">"ushers"</span><span class="o">);</span>
</pre></div>

          <p>You can now read the set. In this case it will find the following:</p>

          <ul>
              <li>"she" starting at position 1, ending at position 3</li>
              <li>"he" starting at position 2, ending at position 3</li>
              <li>"hers" starting at position 2, ending at position 5</li>
          </ul><p>In normal situations you probably want to remove overlapping instances, retaining the longest and left-most
          matches.</p>

          <div class="highlight highlight-java"><pre>    <span class="n">Trie</span> <span class="n">trie</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Trie</span><span class="o">().</span><span class="na">removeOverlaps</span><span class="o">();</span>
    <span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"hot"</span><span class="o">);</span>
    <span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"hot chocolate"</span><span class="o">);</span>
    <span class="n">Collection</span><span class="o">&lt;</span><span class="n">Emit</span><span class="o">&gt;</span> <span class="n">emits</span> <span class="o">=</span> <span class="n">trie</span><span class="o">.</span><span class="na">parseText</span><span class="o">(</span><span class="s">"hot chocolate"</span><span class="o">);</span>
</pre></div>

          <p>The removeOverlaps method tells the Trie to remove all overlapping matches. For this it relies on the following
              conflict resolution rules: 1) longer matches prevail over shorter matches, 2) left-most prevails over right-most.
              There is only one result now:</p>

          <ul>
              <li>"hot chocolate" starting at position 0, ending at position 12</li>
          </ul>

          <div class="highlight highlight-java"><pre>    <span class="n">Trie</span> <span class="n">trie</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Trie</span><span class="o">().</span><span class="na">onlyWholeWords</span><span class="o">();</span>
    <span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"sugar"</span><span class="o">);</span>
    <span class="n">Collection</span><span class="o">&lt;</span><span class="n">Emit</span><span class="o">&gt;</span> <span class="n">emits</span> <span class="o">=</span> <span class="n">trie</span><span class="o">.</span><span class="na">parseText</span><span class="o">(</span><span class="s">"sugarcane sugarcane sugar canesugar"</span><span class="o">);</span>
</pre></div>

          <p>In this case, it will only find one match, whereas it would normally find four. The sugarcane/canesugar words
              are discarded because they are partial matches.</p>

          <p>Some text is WrItTeN in combinations of lowercase and uppercase and therefore hard to identify. You can instruct
              the Trie to lowercase the entire searchtext to ease the matching process.</p>

          <div class="highlight highlight-java"><pre>    <span class="n">Trie</span> <span class="n">trie</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Trie</span><span class="o">().</span><span class="na">caseInsensitive</span><span class="o">();</span>
    <span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"casing"</span><span class="o">);</span>
    <span class="n">Collection</span><span class="o">&lt;</span><span class="n">Emit</span><span class="o">&gt;</span> <span class="n">emits</span> <span class="o">=</span> <span class="n">trie</span><span class="o">.</span><span class="na">parseText</span><span class="o">(</span><span class="s">"CaSiNg"</span><span class="o">);</span>
</pre></div>

          <p>Normally, this match would not be found. With the caseInsensitive settings the entire search text is lowercased
              before the matching begins. Therefore it will find exactly one match. Since you still have control of the original
              search text and you will know exactly where the match was, you can still utilize the original casing.</p>

          <p>In many cases you may want to do useful stuff with both the non-matching and the matching text. In this case, you
              might be better served by using the Trie.tokenize(). It allows you to loop over the entire text and deal with
              matches as soon as you encounter them. Let's look at an example where we want to highlight words from HGttG in HTML:</p>

          <div class="highlight highlight-java"><pre>    <span class="n">String</span> <span class="n">speech</span> <span class="o">=</span> <span class="s">"The Answer to the Great Question... Of Life, "</span> <span class="o">+</span>
            <span class="s">"the Universe and Everything... Is... Forty-two,' said "</span> <span class="o">+</span>
            <span class="s">"Deep Thought, with infinite majesty and calm."</span><span class="o">;</span>
    <span class="n">Trie</span> <span class="n">trie</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Trie</span><span class="o">().</span><span class="na">removeOverlaps</span><span class="o">().</span><span class="na">onlyWholeWords</span><span class="o">().</span><span class="na">caseInsensitive</span><span class="o">();</span>
    <span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"great question"</span><span class="o">);</span>
    <span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"forty-two"</span><span class="o">);</span>
    <span class="n">trie</span><span class="o">.</span><span class="na">addKeyword</span><span class="o">(</span><span class="s">"deep thought"</span><span class="o">);</span>
    <span class="n">Collection</span><span class="o">&lt;</span><span class="n">Token</span><span class="o">&gt;</span> <span class="n">tokens</span> <span class="o">=</span> <span class="n">trie</span><span class="o">.</span><span class="na">tokenize</span><span class="o">(</span><span class="n">speech</span><span class="o">);</span>
    <span class="n">StringBuffer</span> <span class="n">html</span> <span class="o">=</span> <span class="k">new</span> <span class="n">StringBuffer</span><span class="o">();</span>
    <span class="n">html</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">"&lt;html&gt;&lt;body&gt;&lt;p&gt;"</span><span class="o">);</span>
    <span class="k">for</span> <span class="o">(</span><span class="n">Token</span> <span class="n">token</span> <span class="o">:</span> <span class="n">tokens</span><span class="o">)</span> <span class="o">{</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">token</span><span class="o">.</span><span class="na">isMatch</span><span class="o">())</span> <span class="o">{</span>
            <span class="n">html</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">"&lt;i&gt;"</span><span class="o">);</span>
        <span class="o">}</span>
        <span class="n">html</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="n">token</span><span class="o">.</span><span class="na">getFragment</span><span class="o">());</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">token</span><span class="o">.</span><span class="na">isMatch</span><span class="o">())</span> <span class="o">{</span>
            <span class="n">html</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">"&lt;/i&gt;"</span><span class="o">);</span>
        <span class="o">}</span>
    <span class="o">}</span>
    <span class="n">html</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">"&lt;/p&gt;&lt;/body&gt;&lt;/html&gt;"</span><span class="o">);</span>
    <span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">html</span><span class="o">);</span>
</pre></div>

          <h2>
<a name="license" class="anchor" href="#license"><span class="octicon octicon-link"></span></a>License</h2>

<p>Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at</p>

<pre><code>http://www.apache.org/licenses/LICENSE-2.0
</code></pre>

<p>Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.</p>
      </section>

    </div>
    <!--[if !IE]><script>fixScale(document);</script><![endif]-->
              <script type="text/javascript">
            var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
            document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
          </script>
          <script type="text/javascript">
            try {
              var pageTracker = _gat._getTracker("UA-47586863-1");
            pageTracker._trackPageview();
            } catch(err) {}
          </script>

  </body>
</html>