com.twitter.common.text.extractor
Class RegexExtractor

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by com.twitter.common.text.token.TokenStream
          extended by com.twitter.common.text.extractor.RegexExtractor
Direct Known Subclasses:
EmoticonExtractor, HashtagExtractor, URLExtractor, UserNameExtractor

public class RegexExtractor
extends TokenStream

Extracts entities from text according to a given regular expression.


Nested Class Summary
static class RegexExtractor.AbstractBuilder<N extends RegexExtractor,T extends RegexExtractor.AbstractBuilder<N,T>>
           
static class RegexExtractor.Builder
           
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
 
Constructor Summary
protected RegexExtractor()
          Protected constructor for subclass builders, clients should use a builder to create an instance.
 
Method Summary
 boolean incrementToken()
          Consumers call this method to advance the stream to the next token.
 void reset(CharSequence input)
          Reset the extractor to use a new CharSequence as input.
protected  void setRegexPattern(Pattern pattern)
          Sets the regular expression used in this RegexExtractor.
protected  void setRegexPattern(Pattern pattern, int startGroup, int endGroup)
          Sets the regular expression and start/end group ID used in this RegexExtractor.
protected  void setTriggeringChar(char triggeringChar)
          Sets a character that must appear in the input text.
 
Methods inherited from class com.twitter.common.text.token.TokenStream
getInstanceOf, toStringList
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

RegexExtractor

protected RegexExtractor()
Protected constructor for subclass builders, clients should use a builder to create an instance.

Method Detail

setRegexPattern

protected void setRegexPattern(Pattern pattern)
Sets the regular expression used in this RegexExtractor.

Parameters:
pattern - regular expression defining the entities to be extracted

setRegexPattern

protected void setRegexPattern(Pattern pattern,
                               int startGroup,
                               int endGroup)
Sets the regular expression and start/end group ID used in this RegexExtractor.

Parameters:
pattern - Regex pattern of a substring to be replaced.
startGroup - ID of the group in the pattern that matches the beginning of the substring being replaced. Set to 0 to match the entire pattern.
endGroup - ID of the group in the pattern that matches the end of the substring being replace. Set to 0 to match the entire pattern.

setTriggeringChar

protected void setTriggeringChar(char triggeringChar)
Sets a character that must appear in the input text. If a specified character does not appear in the input text, this RegexExtractor does not extract entities from the text. Specifying a triggeringChar may improve the performance by skipping unnecessary pattern matching.

Parameters:
triggeringChar - a character that must appear in the text

reset

public void reset(CharSequence input)
Reset the extractor to use a new CharSequence as input.

Specified by:
reset in class TokenStream
Parameters:
input - CharSequence from which to extract the entities.

incrementToken

public boolean incrementToken()
Description copied from class: TokenStream
Consumers call this method to advance the stream to the next token.

Specified by:
incrementToken in class TokenStream
Returns:
false for end of stream; true otherwise