com.twitter.common.text.token
Class TokenStream

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by com.twitter.common.text.token.TokenStream
Direct Known Subclasses:
LuceneTokenizer2TokenStreamWrapper, RegexExtractor, RegexTokenizer, TokenGroupStream, TokenizedCharSequenceStream, TokenProcessor, TokenStreamAggregator

public abstract class TokenStream
extends org.apache.lucene.util.AttributeSource

Abstraction to enumerate a sequence of tokens. This class represents the central abstraction in Twitter's text processing library, and is similar to Lucene's TokenStream, with the following exceptions:

For an annotated example of how this class is used in practice, refer to TokenizerUsageExample.


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
 
Constructor Summary
  TokenStream()
          Constructs a TokenStream using the default attribute factory.
protected TokenStream(org.apache.lucene.util.AttributeSource.AttributeFactory factory)
          Constructs a TokenStream using the supplied AttributeFactory for creating new Attribute instances.
protected TokenStream(org.apache.lucene.util.AttributeSource input)
          Constructs a TokenStream that uses the same attributes as the supplied one.
 
Method Summary
<T extends TokenStream>
T
getInstanceOf(Class<T> cls)
          Searches and returns an instance of a specified class in this TokenStream chain.
abstract  boolean incrementToken()
          Consumers call this method to advance the stream to the next token.
abstract  void reset(CharSequence input)
          Resets this TokenStream (and also downstream tokens if they exist) to parse a new input.
 List<String> toStringList()
          Converts this token stream into a list of Strings.
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TokenStream

public TokenStream()
Constructs a TokenStream using the default attribute factory.


TokenStream

protected TokenStream(org.apache.lucene.util.AttributeSource.AttributeFactory factory)
Constructs a TokenStream using the supplied AttributeFactory for creating new Attribute instances.

Parameters:
factory - attribute factory

TokenStream

protected TokenStream(org.apache.lucene.util.AttributeSource input)
Constructs a TokenStream that uses the same attributes as the supplied one.

Parameters:
input - attribute source
Method Detail

incrementToken

public abstract boolean incrementToken()
Consumers call this method to advance the stream to the next token.

Returns:
false for end of stream; true otherwise

reset

public abstract void reset(CharSequence input)
Resets this TokenStream (and also downstream tokens if they exist) to parse a new input.

Parameters:
input - new text to parse.

toStringList

public List<String> toStringList()
Converts this token stream into a list of Strings.

Returns:
the contents of the token stream as a list of Strings.

getInstanceOf

public <T extends TokenStream> T getInstanceOf(Class<T> cls)
Searches and returns an instance of a specified class in this TokenStream chain.

Parameters:
cls - class to search for
Returns:
instance of the class cls if found or null if not found