Class RegexTokenizer

  extended by org.apache.lucene.util.AttributeSource
      extended by com.twitter.common.text.token.TokenStream
          extended by com.twitter.common.text.tokenizer.RegexTokenizer
Direct Known Subclasses:

public class RegexTokenizer
extends TokenStream

Tokenizes text based on regular expressions of word delimiters and punctuation characters.

Nested Class Summary
static class RegexTokenizer.AbstractBuilder<N extends RegexTokenizer,T extends RegexTokenizer.AbstractBuilder<N,T>>
static class RegexTokenizer.Builder
          Builder for RegexTokenizer.
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
Constructor Summary
protected RegexTokenizer()
Method Summary
 boolean incrementToken()
          Consumers call this method to advance the stream to the next token.
 void reset(CharSequence input)
          Resets this TokenStream (and also downstream tokens if they exist) to parse a new input.
protected  void setDelimiterPattern(Pattern delimiterPattern)
protected  void setKeepPunctuation(boolean keepPunctuation)
protected  void setPunctuationGroupInDelimiterPattern(int group)
Methods inherited from class com.twitter.common.text.token.TokenStream
getInstanceOf, toStringList
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Constructor Detail


protected RegexTokenizer()
Method Detail


protected void setDelimiterPattern(Pattern delimiterPattern)


protected void setPunctuationGroupInDelimiterPattern(int group)


protected void setKeepPunctuation(boolean keepPunctuation)


public boolean incrementToken()
Description copied from class: TokenStream
Consumers call this method to advance the stream to the next token.

Specified by:
incrementToken in class TokenStream
false for end of stream; true otherwise


public void reset(CharSequence input)
Description copied from class: TokenStream
Resets this TokenStream (and also downstream tokens if they exist) to parse a new input.

Specified by:
reset in class TokenStream
input - new text to parse.