Simplified README text, examples, and corrected typos

2019-09-19 17:21:44 -07:00 · 2019-09-19 17:21:44 -07:00 · 864df8140f
commit 864df8140f
parent 9f80565b53
1 changed files with 186 additions and 156 deletions
--- a/README.md
+++ b/README.md
@ -10,204 +10,233 @@ Aho-Corasick

 Dependency
 ----------
+
 Include this dependency in your POM. Be sure to check for the latest version in Maven Central.
+
 ```xml
-    <dependency>
-        <groupId>org.ahocorasick</groupId>
-        <artifactId>ahocorasick</artifactId>
-        <version>0.4.0</version>
-    </dependency>
+<dependency>
+  <groupId>org.ahocorasick</groupId>
+  <artifactId>ahocorasick</artifactId>
+  <version>0.5.0</version>
+</dependency>
 ```

 Introduction
 ------------
-Nowadays most free-text searching is based on Lucene-like approaches, where the search text is parsed into its
-various components. For every keyword a lookup is done to see where it occurs. When looking for a couple of keywords
-this approach is great. But what about it if you are not looking for just a couple of keywords, but a 100,000 of
-them? Like, for example, checking against a dictionary?

-This is where the Aho-Corasick algorithm shines. Instead of chopping up the search text, it uses all the keywords
-to build up a construct called a [Trie](http://en.wikipedia.org/wiki/Trie). There are three crucial components
-to Aho-Corasick:
+Most free-text searching is based on Lucene-like approaches, where the
+search text is parsed into its various components. For every keyword a
+lookup is done to see where it occurs. When looking for a couple of keywords
+this approach is great, but when searching for 100,000 words, the approach
+is quite slow (for example, checking against a dictionary).
+
+The Aho-Corasick algorithm shines when looking for multiple words.
+Rather than chop up the search text, it uses all the keywords to build
+a [Trie](http://en.wikipedia.org/wiki/Trie) construct. The crucial
+Aho-Corasick components include:
+
 * goto
 * fail
 * output

-Every character encountered is presented to a state object within the *goto* structure. If there is a matching state,
-that will be elevated to the new current state.
+Every character encountered is presented to a state object within the
+*goto* structure. If there is a matching state, that will be elevated to
+the new current state.

-However, if there is no matching state, the algorithm will signal a *fail* and fall back to states with less depth
-(ie, a match less long) and proceed from there, until it found a matching state, or it has reached the root state.
+However, if there is no matching state, the algorithm will signal a
+*fail* and fall back to states with less depth (i.e.,, a match less long)
+and proceed from there, until it found a matching state, or it has reached
+the root state.

-Whenever a state is reached that matches an entire keyword, it is emitted to an *output* set which can be read after
-the entire scan has completed.
+Whenever a state is reached that matches an entire keyword, it is
+emitted to an *output* set which can be read after the entire scan
+has completed.

-The beauty of the algorithm is that it is O(n). No matter how many keywords you have, or how big the search text is,
-the performance will decline in a linear way.
+The algorithm is O(n). No matter how many keywords are given, or how large
+the search text is, the performance will decline linearly.

-Some examples you could use the Aho-Corasick algorithm for:
-* looking for certain words in texts in order to URL link or emphasize them
-* adding semantics to plain text
-* checking against a dictionary to see if syntactic errors were made
+The Aho-Corasick algorithm can help:

-This library is the Java implementation of the afore-mentioned Aho-Corasick algorithm for efficient string matching.
-The algorithm is explained in great detail in the white paper written by
-Aho and Corasick: http://cr.yp.to/bib/1975/aho.pdf
+* find words in texts to link or emphasize them;
+* add semantics to plain text; or
+* checki against a dictionary to see if syntactic errors were made.
+
+See the [white paper](http://cr.yp.to/bib/1975/aho.pdf) by Aho and
+Corasick for algorithmic details.

 Usage
 -----
-Setting up the Trie is a piece of cake:
+Set up the Trie using a builder as follows:
+
 ```java
-    Trie trie = Trie.builder()
+Trie trie = Trie.builder()
+    .addKeyword("hers")
+    .addKeyword("his")
+    .addKeyword("she")
+    .addKeyword("he")
+    .build();
+Collection<Emit> emits = trie.parseText("ushers");
+```
+
+The collection will contain `Emit` objects that match:
+
+* "she" starting at position 1, ending at position 3
+* "he" starting at position 2, ending at position 3
+* "hers" starting at position 2, ending at position 5
+
+In situations where overlapping instances are not desired, retain
+the longest and left-most matches by calling `ignoreOverlaps()`:
+
+```java
+Trie trie = Trie.builder()
+    .ignoreOverlaps()
+    .addKeyword("hot")
+    .addKeyword("hot chocolate")
+    .build();
+Collection<Emit> emits = trie.parseText("hot chocolate");
+```
+
+The `ignoreOverlaps()` method tells the Trie to remove all overlapping
+matches. For this it relies on the following conflict resolution rules:
+
+1. longer matches prevail over shorter matches; and
+1. left-most prevails over right-most.
+
+Only one result is returned:
+
+* "hot chocolate" starting at position 0, ending at position 12
+
+To check for whole words exclusively, call `onlyWholeWords()` as follows:
+
+```java
+Trie trie = Trie.builder()
+    .onlyWholeWords()
+    .addKeyword("sugar")
+    .build();
+Collection<Emit> emits = trie.parseText("sugarcane sugar canesugar");
+```
+
+Only one match is found; whereas, without calling `onlyWholeWords()` four
+matches are found. The sugarcane/canesugar words are discarded because
+they are partial matches.
+
+Some text is `WrItTeN` in mixed case, which makes it hard to identify.
+Instruct the Trie to convert the searchtext to lowercase to ease the
+matching process. The lower-casing applies to keywords as well.
+
+```java
+Trie trie = Trie.builder()
+    .ignoreCase()
+    .addKeyword("casing")
+    .build();
+Collection<Emit> emits = trie.parseText("CaSiNg");
+```
+
+Normally, this match would not be found. By calling `ignoreCase()`,
+the entire search text is made lowercase before matching begins.
+Therefore it will find exactly one match.
+
+It is also possible to just ask whether the text matches any of
+the keywords, or just to return the first match it finds.
+
+```java
+Trie trie = Trie.builder().ignoreOverlaps()
+        .addKeyword("ab")
+        .addKeyword("cba")
+        .addKeyword("ababc")
+        .build();
+Emit firstMatch = trie.firstMatch("ababcbab");
+```
+
+The value for `firstMatch` will be "ababc" from position 0. The
+`containsMatch()` method checks whether `firstMatch` found a match and
+returns `true` if that is the case.
+
+For a barebones Aho-Corasick algorithm with a custom emit handler use:
+ 
+```java
+Trie trie = Trie.builder()
        .addKeyword("hers")
        .addKeyword("his")
        .addKeyword("she")
        .addKeyword("he")
        .build();
-    Collection<Emit> emits = trie.parseText("ushers");
-```

-You can now read the set. In this case it will find the following:
-* "she" starting at position 1, ending at position 3
-* "he" starting at position 2, ending at position 3
-* "hers" starting at position 2, ending at position 5
+final List<Emit> emits = new ArrayList<>();
+EmitHandler emitHandler = new EmitHandler() {

-In normal situations you probably want to remove overlapping instances, retaining the longest and left-most
-matches.
-
-```java
-    Trie trie = Trie.builder()
-        .ignoreOverlaps()
-        .addKeyword("hot")
-        .addKeyword("hot chocolate")
-        .build();
-    Collection<Emit> emits = trie.parseText("hot chocolate");
-```
-
-The ignoreOverlaps method tells the Trie to remove all overlapping matches. For this it relies on the following
-conflict resolution rules: 1) longer matches prevail over shorter matches, 2) left-most prevails over right-most.
-There is only one result now:
-* "hot chocolate" starting at position 0, ending at position 12
-
-If you want the algorithm to only check for whole words, you can tell the Trie to do so:
-
-```java
-    Trie trie = Trie.builder()
-        .onlyWholeWords()
-        .addKeyword("sugar")
-        .build();
-    Collection<Emit> emits = trie.parseText("sugarcane sugarcane sugar canesugar");
-```
-
-In this case, it will only find one match, whereas it would normally find four. The sugarcane/canesugar words
-are discarded because they are partial matches.
-
-Some text is WrItTeN in a combination of lowercase and uppercase and therefore hard to identify. You can instruct
-the Trie to lowercase the entire searchtext to ease the matching process. The lower-casing extends to keywords as well.
-
-```java
-    Trie trie = Trie.builder()
-        .ignoreCase()
-        .addKeyword("casing")
-        .build();
-    Collection<Emit> emits = trie.parseText("CaSiNg");
-```
-
-Normally, this match would not be found. With the ignoreCase settings the entire search text is lowercased
-before the matching begins. Therefore it will find exactly one match. Since you still have control of the original
-search text and you will know exactly where the match was, you can still utilize the original casing.
-
-It is also possible to just ask whether the text matches any of the keywords, or just to return the first match it 
-finds.
-
-```java
-    Trie trie = Trie.builder().ignoreOverlaps()
-            .addKeyword("ab")
-            .addKeyword("cba")
-            .addKeyword("ababc")
-            .build();
-    Emit firstMatch = trie.firstMatch("ababcbab");
-```
-
-The firstMatch will now be "ababc" found at position 0. containsMatch just checks if there is a firstMatch and
-returns true if that is the case.
-
-If you just want the barebones Aho-Corasick algorithm (ie, no dealing with case insensitivity, overlaps and whole
- words) and you prefer to add your own handler to the mix, that is also possible.
- 
-```java
-    Trie trie = Trie.builder()
-            .addKeyword("hers")
-            .addKeyword("his")
-            .addKeyword("she")
-            .addKeyword("he")
-            .build();
-
-    final List<Emit> emits = new ArrayList<>();
-    EmitHandler emitHandler = new EmitHandler() {
-
-        @Override
-        public void emit(Emit emit) {
-            emits.add(emit);
-        }
-    };
-```
-
-In many cases you may want to do useful stuff with both the non-matching and the matching text. In this case, you
-might be better served by using the Trie.tokenize(). It allows you to loop over the entire text and deal with
-matches as soon as you encounter them. Let's look at an example where we want to highlight words from HGttG in HTML:
-
-```java
-    String speech = "The Answer to the Great Question... Of Life, " +
-            "the Universe and Everything... Is... Forty-two,' said " +
-            "Deep Thought, with infinite majesty and calm.";
-    Trie trie = Trie.builder().ignoreOverlaps().onlyWholeWords().ignoreCase()
-        .addKeyword("great question")
-        .addKeyword("forty-two")
-        .addKeyword("deep thought")
-        .build();
-    Collection<Token> tokens = trie.tokenize(speech);
-    StringBuffer html = new StringBuffer();
-    html.append("<html><body><p>");
-    for (Token token : tokens) {
-        if (token.isMatch()) {
-            html.append("<i>");
-        }
-        html.append(token.getFragment());
-        if (token.isMatch()) {
-            html.append("</i>");
-        }
+    @Override
+    public void emit(Emit emit) {
+        emits.add(emit);
    }
-    html.append("</p></body></html>");
-    System.out.println(html);
+};
+```
+
+In many cases you may want to do perform tasks with both the non-matching
+and the matching text. Such implementations may be better served by using
+`Trie.tokenize()`. The `tokenize()` method allows looping over the
+corpus to deal with matches as soon as they are encountered. Here's an
+example that outputs key words as italicized HTML elements:
+
+```java
+String speech = "The Answer to the Great Question... Of Life, " +
+        "the Universe and Everything... Is... Forty-two,' said " +
+        "Deep Thought, with infinite majesty and calm.";
+
+Trie trie = Trie.builder().ignoreOverlaps().onlyWholeWords().ignoreCase()
+    .addKeyword("great question")
+    .addKeyword("forty-two")
+    .addKeyword("deep thought")
+    .build();
+
+Collection<Token> tokens = trie.tokenize(speech);
+StringBuilder html = new StringBuilder();
+html.append("<html><body><p>");
+
+for (Token token : tokens) {
+    if (token.isMatch()) {
+        html.append("<i>");
+    }
+    html.append(token.getFragment());
+    if (token.isMatch()) {
+        html.append("</i>");
+    }
+}
+
+html.append("</p></body></html>");
+System.out.println(html);
 ```

 You can also emit custom outputs. This might for example be useful to implement a trivial named entity 
 recognizer. In this case use a PayloadTrie instead of a Trie:

 ```java
-    class Word {
-        private final String gender;
-        public Word(String gender) {
-            this.gender = gender;
-        }
+class Word {
+    private final String gender;
+    public Word(String gender) {
+        this.gender = gender;
    }
-    
-    PayloadTrie<Word> trie = PayloadTrie.<Word>builder()
-        .addKeyword("hers", new Word("f")
-        .addKeyword("his", new Word("m"))
-        .addKeyword("she", new Word("f"))
-        .addKeyword("he", new Word("m"))
-        .build();
-    Collection<PayloadEmit<Word>> emits = trie.parseText("ushers");
+}
+
+PayloadTrie<Word> trie = PayloadTrie.<Word>builder()
+    .addKeyword("hers", new Word("f")
+    .addKeyword("his", new Word("m"))
+    .addKeyword("she", new Word("f"))
+    .addKeyword("he", new Word("m"))
+    .addKeyword("nonbinary", new Word("nb"))
+    .addKeyword("transgender", new Word("tg"))
+    .build();
+Collection<PayloadEmit<Word>> emits = trie.parseText("ushers");
 ```

 Releases
 --------
-Information on the aho-corasick [releases](https://github.com/robert-bor/aho-corasick/releases).
+
+See [releases](https://github.com/robert-bor/aho-corasick/releases) for details.

 License
 -------
+
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at
@ -219,3 +248,4 @@ License
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
+