Feature Generators

Feature generators define the features used by entity model training and entity extraction. The features are defined in an XML file and the file is referenced when creating custom entity models.

Example Feature Generator XML File

The feature generator XML file defines the features used to train the model. Changing the features can have a significant impact (both negative and positive) on the performance of your generated model. Features should be chosen carefully to optimize model performance. An example feature generator XML file is:

<?xml version="1.0" encoding="UTF-8"?>
<generators>
  <cache>
    <generators>
      <window prevLength = "2" nextLength = "2">          
        <tokenclass/>
      </window>    
      <prevmap/>
      <bigram/>
      <sentence begin="true" end="false"/>
    </generators>
  </cache>
</generators>

View the feature generator XSD.

Feature Generators

There are two types of feature generators: simple and aggregate. An aggregate feature generator can be applied to one or more simple feature generators. To apply each feature generator, copy the example snippet into the feature generator XML file under the generators attribute.

Simple Feature Generators

Bi-gram Feature Generator

This feature generator generates features based on a bi-gram composed of the token and the previous token (when there is a previous token) and based on a bi-gram composed of the token and the next token (when there is a next token).

<bigram />

Document Beginning Feature Generator

This feature generator generates features based on the first sentence of a document.

<docbegin />

Prefix Feature Generator

This feature generator generates features based on the token’s prefix. A prefix is defined as up to the first four characters of the token.

<prefix />

Previous Map Feature Generator

This feature generator generates features based on the outcome of a previously occurring token.

<prevmap />

Sentence Boundary Feature Generator

This feature generator generates features based on the annotated token’s position in the sentence. Set the begin property to true to generate a feature if the token starts a sentence. Set the end property to true to generate a feature if the token ends a sentence.

<sentence begin="true" end="false"/>

Special Character Feature Generator

This feature generator generates features based on tokens that contains special characters. (A special character is any character not in the set A-Z, a-z, and 0-9.)

<specchar />

Token Part-of-Speech Feature Generator

This feature generator generates features based on the token’s part-of-speech. This feature generator requires a trained parts-of-speech model. The modelPath property is the full path to the directory containing the models. The modelManifest property is the file name of the parts-of-speech model manifest. (The model manifest file should be in the modelPath.)

<tokenpos modelPath="full/path/to/model/directory/" modelManifest="pos-model.manifest" />

Tri-gram Feature Generator

This feature generator generates features based on a tri-gram composed of the token and the two previous tokens (when there are two previous tokens) and based on a tri-gram composed of the token and the next two tokens (when there are two next tokens).

<trigram />

Suffix Feature Generator

This feature generator generates features based on the token’s suffix. A suffix is defined as up to the last four characters of the token.

<suffix />

Token Feature Generator

This feature generator generates a feature containing the token itself.

<token />

Token Class Feature Generator

This feature generator generates a feature based on the content (lowercase alphabetic, digits, alphanumeric, etc.) of the token.

<tokenclass />

Word Normalization Feature Generator

This feature generator generates features by normalizing all tokens. A uppercase characters are replaced with A, all lowercase characters are replaced with a, and all digits are replaced with 0. For example, the token HelloWorld59 would create a feature AaaaaAaaaa00.

<wordnormalization />

These feature generators work in conjunction with one or more simple feature generators.

Window Feature Generator

This feature generator creates features a window of fixed size around an annotated token. The prevLength property specifies the length of the window prior to the token, and the nextLength property specifies the length of the window after the token.

<window prevLength = "2" nextLength = "2">          
   <tokenclass/>
</window>