Package

com.twitter.penguin.korean

tokenizer

Permalink

package tokenizer

Visibility
  1. Public
  2. All

Type Members

  1. case class KoreanChunk(text: String, offset: Int, length: Int) extends Product with Serializable

    Permalink
  2. case class Sentence(text: String, start: Int, end: Int) extends Product with Serializable

    Permalink

Value Members

  1. object KoreanChunker

    Permalink

    Split input text into Korean Chunks (어절)

  2. object KoreanSentenceSplitter

    Permalink
  3. object KoreanTokenizer

    Permalink

    Provides Korean tokenization.

    Provides Korean tokenization.

    Chunk: 어절 - 공백으로 구분되어 있는 단위 (사랑하는사람을) Word: 단어 - 하나의 문장 구성 요소 (사랑하는, 사람을) Token: 토큰 - 형태소와 비슷한 단위이지만 문법적으로 정확하지는 않음 (사랑, 하는, 사람, 을)

    Whenever there is an updates in the behavior of KoreanParser, the initial cache has to be updated by running tools.CreateInitialCache.

Ungrouped