com.twitter.common.text.tokenizer
Class RegexTokenizer

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by com.twitter.common.text.token.TokenStream
          extended by com.twitter.common.text.tokenizer.RegexTokenizer
Direct Known Subclasses:
LatinTokenizer

public class RegexTokenizer
extends TokenStream

Tokenizes text based on regular expressions of word delimiters and punctuation characters.


Nested Class Summary
static class RegexTokenizer.AbstractBuilder<N extends RegexTokenizer,T extends RegexTokenizer.AbstractBuilder<N,T>>
           
static class RegexTokenizer.Builder
          Builder for RegexTokenizer.
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
 
Constructor Summary
protected RegexTokenizer()
           
 
Method Summary
 boolean incrementToken()
          Consumers call this method to advance the stream to the next token.
 void reset(CharSequence input)
          Resets this TokenStream (and also downstream tokens if they exist) to parse a new input.
protected  void setDelimiterPattern(Pattern delimiterPattern)
           
protected  void setKeepPunctuation(boolean keepPunctuation)
           
protected  void setPunctuationGroupInDelimiterPattern(int group)
           
 
Methods inherited from class com.twitter.common.text.token.TokenStream
getInstanceOf, toStringList
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

RegexTokenizer

protected RegexTokenizer()
Method Detail

setDelimiterPattern

protected void setDelimiterPattern(Pattern delimiterPattern)

setPunctuationGroupInDelimiterPattern

protected void setPunctuationGroupInDelimiterPattern(int group)

setKeepPunctuation

protected void setKeepPunctuation(boolean keepPunctuation)

incrementToken

public boolean incrementToken()
Description copied from class: TokenStream
Consumers call this method to advance the stream to the next token.

Specified by:
incrementToken in class TokenStream
Returns:
false for end of stream; true otherwise

reset

public void reset(CharSequence input)
Description copied from class: TokenStream
Resets this TokenStream (and also downstream tokens if they exist) to parse a new input.

Specified by:
reset in class TokenStream
Parameters:
input - new text to parse.