Named Entity Extraction

Recognize people, companies, organizations, cities and other predefined entities within HTML or text documents.

Text-Analysis implements two types of Named Entity Recognition (NER):

  • Dictionary based NER: the terms in the document are looked up in a very big dictionary. The search takes anyway  little time, because the Aho-Corasick algorithm is used, which finds all matches against a dictionary in linear time, that means independently regards to the number of matches or the size of the dictionary.
  • Conditional Random Field NER: In this case the engine must be trained, but the CRF model has the vantage that the terms are identified heuristically (i.e. new occurrences are recognized without the need to build and update a dictionary).