aho-corasick/params.json
2014-01-29 13:22:22 -08:00

1 line
3.3 KiB
JSON

{"name":"Aho-Corasick","tagline":"Java implementation of the Aho-Corasick algorithm for efficient string matching","body":"Aho-Corasick\r\n============\r\n\r\nIntroduction\r\n------------\r\nNowadays most free-text searching is based on Lucene-like approaches, where the search text is parsed into its\r\nvarious components. For every keyword a lookup is done to see where it occurs. When looking for a couple of keywords\r\nthis approach is great. But what about it if you are not looking for just a couple of keywords, but a 100,000 of\r\nthem? Like, for example, checking against a dictionary?\r\n\r\nThis is where the Aho-Corasick algorithm shines. Instead of chopping up the search text, it uses all the keywords\r\nto build up a construct called a [Trie](http://en.wikipedia.org/wiki/Trie). There are three crucial components\r\nto Aho-Corasick:\r\n* goto\r\n* fail\r\n* output\r\n\r\nEvery character encountered is presented to a state object within the goto structure. If there is a matching state,\r\nthat will be elevated to the new current state.\r\n\r\nHowever, if there is no matching state, the algorithm will fall back to states with less depth (ie, a match less long)\r\nand proceed from there, until it found a matching state, or it has reached the root state.\r\n\r\nWhenever a state is reached that matches an entire keyword, it is emitted to an output set which can be read after the\r\nentire scan has completed.\r\n\r\nThe beauty of the algorithm is that it is O(n). No matter how many keywords you have, or how big the search text is,\r\nthe performance will decline in a linear way.\r\n\r\nSome examples you could use the Aho-Corasick algorithm for:\r\n* looking for certain words in texts in order to URL link or emphasize them\r\n* adding semantics to plain text\r\n* checking against a dictionary to see if syntactic errors were made\r\n\r\nThis library is the Java implementation of the afore-mentioned Aho-Corasick algorithm for efficient string matching.\r\nThe algorithm is explained in great detail in the white paper written by\r\n[Aho and Corasick](ftp://163.13.200.222/assistant/bearhero/prog/%A8%E4%A5%A6/ac_bm.pdf).\r\n\r\nUsage\r\n-----\r\n\r\nSetting up the Trie is a piece of cake:\r\n```java\r\n Trie trie = new Trie();\r\n trie.addKeyword(\"hers\");\r\n trie.addKeyword(\"his\");\r\n trie.addKeyword(\"she\");\r\n trie.addKeyword(\"he\");\r\n Collection<Emit> emits = trie.parseText(\"ushers\");\r\n```\r\n\r\nYou can now read the set. In this case it will find the following:\r\n* \"she\" at position 3\r\n* \"he\" at position 3\r\n* \"hers\" at position 5\r\n\r\nLicense\r\n-------\r\n Licensed under the Apache License, Version 2.0 (the \"License\");\r\n you may not use this file except in compliance with the License.\r\n You may obtain a copy of the License at\r\n\r\n\thttp://www.apache.org/licenses/LICENSE-2.0\r\n\r\n Unless required by applicable law or agreed to in writing, software\r\n distributed under the License is distributed on an \"AS IS\" BASIS,\r\n WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\r\n See the License for the specific language governing permissions and\r\n limitations under the License.\r\n","google":"UA-47586863-1","note":"Don't delete this file! It's used internally to help with page regeneration."}