From d8af03eb043af5f287c34a87c19320e9a4af5978 Mon Sep 17 00:00:00 2001 From: Dave Jarvis Date: Mon, 9 Nov 2020 19:03:59 -0800 Subject: [PATCH] Add Apache license --- LICENSE.md | 243 +++-------------------------------------------------- 1 file changed, 10 insertions(+), 233 deletions(-) diff --git a/LICENSE.md b/LICENSE.md index a60abec..2fc71d5 100644 --- a/LICENSE.md +++ b/LICENSE.md @@ -1,237 +1,14 @@ -Aho-Corasick -============ +Copyright [2020] [Robert Bor] -[![Build Status](https://travis-ci.org/robert-bor/aho-corasick.svg?branch=master)](https://travis-ci.org/robert-bor/aho-corasick) -[![Codacy Badge](https://api.codacy.com/project/badge/Grade/0f65bfb641f745a4b301b85d028a4a8d)](https://www.codacy.com/app/bor-robert/aho-corasick) -[![Codecov](https://codecov.io/gh/robert-bor/aho-corasick/branch/master/graph/badge.svg)](https://codecov.io/gh/robert-bor/aho-corasick) -[![Maven Central](https://maven-badges.herokuapp.com/maven-central/org.ahocorasick/ahocorasick/badge.svg)](https://maven-badges.herokuapp.com/maven-central/org.ahocorasick/ahocorasick) -[![Javadoc](https://javadoc-emblem.rhcloud.com/doc/org.ahocorasick/ahocorasick/badge.svg)](http://www.javadoc.io/doc/org.ahocorasick/ahocorasick) -[![Apache 2](http://img.shields.io/badge/license-Apache%202-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0) +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at -Dependency ----------- + http://www.apache.org/licenses/LICENSE-2.0 -Include this dependency in your POM. Be sure to check for the latest version in Maven Central. - -```xml - - org.ahocorasick - ahocorasick - 0.5.0 - -``` - -Introduction ------------- - -Most free-text searching is based on Lucene-like approaches, where the -search text is parsed into its various components. For every keyword a -lookup is done to see where it occurs. When looking for a couple of keywords -this approach is great, but when searching for 100,000 words, the approach -is quite slow (for example, checking against a dictionary). - -The Aho-Corasick algorithm shines when looking for multiple words. -Rather than chop up the search text, it uses all the keywords to build -a [Trie](http://en.wikipedia.org/wiki/Trie) construct. The crucial -Aho-Corasick components include: - -* goto -* fail -* output - -Every character encountered is presented to a state object within the -*goto* structure. If there is a matching state, that will be elevated to -the new current state. - -However, if there is no matching state, the algorithm will signal a -*fail* and fall back to states with less depth (i.e., a match less long) -and proceed from there, until it found a matching state, or it has reached -the root state. - -Whenever a state is reached that matches an entire keyword, it is -emitted to an *output* set which can be read after the entire scan -has completed. - -The algorithm is O(n). No matter how many keywords are given, or how large -the search text is, the performance will decline linearly. - -The Aho-Corasick algorithm can help: - -* find words in texts to link or emphasize them; -* add semantics to plain text; or -* check against a dictionary to see if syntactic errors were made. - -See the [white paper](http://cr.yp.to/bib/1975/aho.pdf) by Aho and -Corasick for algorithmic details. - -Usage ------ -Set up the Trie using a builder as follows: - -```java -Trie trie = Trie.builder() - .addKeyword("hers") - .addKeyword("his") - .addKeyword("she") - .addKeyword("he") - .build(); -Collection emits = trie.parseText("ushers"); -``` - -The collection will contain `Emit` objects that match: - -* "she" starting at position 1, ending at position 3 -* "he" starting at position 2, ending at position 3 -* "hers" starting at position 2, ending at position 5 - -In situations where overlapping instances are not desired, retain -the longest and left-most matches by calling `ignoreOverlaps()`: - -```java -Trie trie = Trie.builder() - .ignoreOverlaps() - .addKeyword("hot") - .addKeyword("hot chocolate") - .build(); -Collection emits = trie.parseText("hot chocolate"); -``` - -The `ignoreOverlaps()` method tells the Trie to remove all overlapping -matches. For this it relies on the following conflict resolution rules: - -1. longer matches prevail over shorter matches; and -1. left-most prevails over right-most. - -Only one result is returned: - -* "hot chocolate" starting at position 0, ending at position 12 - -To check for whole words exclusively, call `onlyWholeWords()` as follows: - -```java -Trie trie = Trie.builder() - .onlyWholeWords() - .addKeyword("sugar") - .build(); -Collection emits = trie.parseText("sugarcane sugar canesugar"); -``` - -Only one match is found; whereas, without calling `onlyWholeWords()` four -matches are found. The sugarcane/canesugar words are discarded because -they are partial matches. - -Some text is `WrItTeN` in mixed case, which makes it hard to identify. -Instruct the Trie to convert the searchtext to lowercase to ease the -matching process. The lower-casing applies to keywords as well. - -```java -Trie trie = Trie.builder() - .ignoreCase() - .addKeyword("casing") - .build(); -Collection emits = trie.parseText("CaSiNg"); -``` - -Normally, this match would not be found. By calling `ignoreCase()`, -the entire search text is made lowercase before matching begins. -Therefore it will find exactly one match. - -It is also possible to just ask whether the text matches any of -the keywords, or just to return the first match it finds. - -```java -Trie trie = Trie.builder().ignoreOverlaps() - .addKeyword("ab") - .addKeyword("cba") - .addKeyword("ababc") - .build(); -Emit firstMatch = trie.firstMatch("ababcbab"); -``` - -The value for `firstMatch` will be "ababc" from position 0. The -`containsMatch()` method checks whether `firstMatch` found a match and -returns `true` if that is the case. - -For a barebones Aho-Corasick algorithm with a custom emit handler use: - -```java -Trie trie = Trie.builder() - .addKeyword("hers") - .addKeyword("his") - .addKeyword("she") - .addKeyword("he") - .build(); - -final List emits = new ArrayList<>(); -EmitHandler emitHandler = new EmitHandler() { - - @Override - public void emit(Emit emit) { - emits.add(emit); - } -}; -``` - -In many cases you may want to do perform tasks with both the non-matching -and the matching text. Such implementations may be better served by using -`Trie.tokenize()`. The `tokenize()` method allows looping over the -corpus to deal with matches as soon as they are encountered. Here's an -example that outputs key words as italicized HTML elements: - -```java -String speech = "The Answer to the Great Question... Of Life, " + - "the Universe and Everything... Is... Forty-two,' said " + - "Deep Thought, with infinite majesty and calm."; - -Trie trie = Trie.builder().ignoreOverlaps().onlyWholeWords().ignoreCase() - .addKeyword("great question") - .addKeyword("forty-two") - .addKeyword("deep thought") - .build(); - -Collection tokens = trie.tokenize(speech); -StringBuilder html = new StringBuilder(); -html.append("

"); - -for (Token token : tokens) { - if (token.isMatch()) { - html.append(""); - } - html.append(token.getFragment()); - if (token.isMatch()) { - html.append(""); - } -} - -html.append("

"); -System.out.println(html); -``` - -You can also emit custom outputs. This might for example be useful to -implement a trivial named entity recognizer. In this case use a -`PayloadTrie` instead of a `Trie` as follows: - -```java -class Word { - private final String gender; - public Word(String gender) { - this.gender = gender; - } -} - -PayloadTrie trie = PayloadTrie.builder() - .addKeyword("hers", new Word("f") - .addKeyword("his", new Word("m")) - .addKeyword("she", new Word("f")) - .addKeyword("he", new Word("m")) - .addKeyword("nonbinary", new Word("nb")) - .addKeyword("transgender", new Word("tg")) - .build(); -Collection> emits = trie.parseText("ushers"); -``` - -Releases --------- - -See [releases](https://github.com/robert-bor/aho-corasick/releases) for details. +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License.