Count Min Sketch
Count-min sketch is a probablistic data structure that estimates the frequencies of elements in a data stream. Count-min sketches are somewhat similar to Bloom filters; the main distinction is that Bloom filters represent sets, while count-min sketches represent multisets. For more info, see Wikipedia.
In Algebird, count-min sketches are represented as the abstract class CMS
, along with the CMSMonoid
class. Here’s an example usage:
import com.twitter.algebird._
// import com.twitter.algebird._
import CMSHasherImplicits._
// import CMSHasherImplicits._
val DELTA = 1E-10
// DELTA: Double = 1.0E-10
val EPS = 0.001
// EPS: Double = 0.001
val SEED = 1
// SEED: Int = 1
val HEAVY_HITTERS_PCT = 0.01
// HEAVY_HITTERS_PCT: Double = 0.01
val CMS_MONOID = TopPctCMS.monoid[Long](EPS, DELTA, SEED, HEAVY_HITTERS_PCT)
// CMS_MONOID: com.twitter.algebird.TopPctCMSMonoid[Long] = com.twitter.algebird.TopPctCMSMonoid@1ba0d98d
val data = List(1L, 1L, 3L, 4L, 5L)
// data: List[Long] = List(1, 1, 3, 4, 5)
val cms = CMS_MONOID.create(data)
// cms: com.twitter.algebird.TopCMS[Long] = TopCMSInstance(SparseCMS(Map(1 -> 2, 3 -> 1, 4 -> 1, 5 -> 1),5,CMSParams(Vector(CMSHash(-1155869325,0,2719), CMSHash(431529176,0,2719), CMSHash(1761283695,0,2719), CMSHash(1749940626,0,2719), CMSHash(892128508,0,2719), CMSHash(155629808,0,2719), CMSHash(1429008869,0,2719), CMSHash(-1465154083,0,2719), CMSHash(-138487339,0,2719), CMSHash(-1242363800,0,2719), CMSHash(26273138,0,2719), CMSHash(655996946,0,2719), CMSHash(-155886662,0,2719), CMSHash(685382526,0,2719), CMSHash(-258276172,0,2719), CMSHash(-1915244828,0,2719), CMSHash(-226796111,0,2719), CMSHash(-382464772,0,2719), CMSHash(-270230103,0,2719), CMSHash(2092024379,0,2719), CMSHash(1705850753,0,2719), CMSHash(-369526632,0,2719), CMSHash(1492578621,0,2719), CMSHash(684358198,0,2719)),0.001,1....
cms.totalCount
// res0: Long = 5
cms.frequency(1L).estimate
// res1: Long = 2
cms.frequency(2L).estimate
// res2: Long = 0
cms.frequency(3L).estimate
// res3: Long = 1
CMS Hasher
The Count-Min sketch uses d
(aka depth
) pair-wise independent hash functions drawn from a universal hashing family of the form:
h(x) = [a * x + b (mod p)] (mod m)
As a requirement for using CMS
you must provide an implicit CMSHasher[K]
for the type K
of the items you want to count. Algebird ships with several such implicits for commonly used types K
such as Long
and scala.BigInt
.
If your type K
is not supported out of the box, you have two options:
- You provide a “translation” function to convert items of your (unsupported) type
K
to a supported type such asDouble
, and then use thecontramap
function ofCMSHasher
to create the requiredCMSHasher[K]
for your type (see the documentation ofcontramap
for an example) - You implement a
CMSHasher[K]
from scratch, using the existingCMSHasher
implementations as a starting point.
Sketch Map
A Sketch Map is a generalized version of the Count-Min Sketch that is an approximation of Map[K, V] that stores reference to top heavy hitters. The Sketch Map can approximate the sums of any summable value that has a monoid.
val DELTA = 1E-8
// DELTA: Double = 1.0E-8
val EPS = 0.001
// EPS: Double = 0.001
val SEED = 1
// SEED: Int = 1
val HEAVY_HITTERS_COUNT = 10
// HEAVY_HITTERS_COUNT: Int = 10
implicit def string2Bytes(i : String) = i.toCharArray.map(_.toByte)
// string2Bytes: (i: String)Array[Byte]
val PARAMS = SketchMapParams[String](SEED, EPS, DELTA, HEAVY_HITTERS_COUNT)
// PARAMS: com.twitter.algebird.SketchMapParams[String] = SketchMapParams(1,2719,19,10)
val MONOID = SketchMap.monoid[String, Long](PARAMS)
// MONOID: com.twitter.algebird.SketchMapMonoid[String,Long] = com.twitter.algebird.SketchMapMonoid@61ec1133
val data = List( ("1", 1L), ("3", 2L), ("4", 1L), ("5", 1L) )
// data: List[(String, Long)] = List((1,1), (3,2), (4,1), (5,1))
val sm = MONOID.create(data)
// sm: com.twitter.algebird.SketchMap[String,Long] =
// SketchMap(Row: 19, Cols: 2719. Dense elements:
// List((282,1), (763,2), (2172,1), (2523,1))
// List((112,1), (364,2), (923,1), (2433,1))
// List((1615,1), (1652,1), (1794,2), (2357,1))
// List((497,2), (1687,1), (2288,1), (2395,1))
// List((1115,2), (1252,1), (1441,1), (1798,1))
// List((309,1), (556,2), (944,1), (2580,1))
// List((474,1), (570,2), (1073,1), (2271,1))
// List((194,1), (1101,2), (1420,1), (2296,1))
// List((505,1), (1943,1), (1949,1), (2047,2))
// List((1062,1), (1646,1), (1840,1), (2139,2))
// List((549,1), (820,1), (1468,1), (1547,2))
// List((837,2), (1064,1), (1941,1), (2305,1))
// List((338,1), (482,1), (1331,2), (2285,1))
// List((158,1), (1010,2), (1244,1), (1725,1))
// List((90,1), (606,1), (1464,1), (1474,2))
// List((143,1), (768,1), (935,2), (1340,1))
// List(...
sm.totalValue
// res4: Long = 5
MONOID.frequency(sm, "1")
// res5: Long = 1
MONOID.frequency(sm, "2")
// res6: Long = 0
MONOID.frequency(sm, "3")
// res7: Long = 2
Documentation Help
We’d love your help fleshing out this documentation! You can edit this page in your browser by clicking this link. These links might be helpful: