One-sided error bound on the error of each point query, i.e. frequency estimate.
A bound on the probability that a query estimate does not lie within some small interval
(an interval that depends on eps
) around the truth.
A seed to initialize the random number generator used to create the pairwise independent hash functions.
An Option parameter about how many exact counts a sparse CMS wants to keep.
Creates a sketch out of multiple items.
Creates a sketch out of a single item.
Combines the two sketches.
Returns an instance of T
calculated by summing all instances in
iter
in one pass.
Returns an instance of T
calculated by summing all instances in
iter
in one pass. Returns None
if iter
is empty, else
Some[T]
.
None
if iter
is empty, else an option value containing the summed T
Returns the identity element of T
for plus.
Monoid for adding CMS sketches.
Usage
eps
anddelta
are parameters that bound the error of each query estimate. For example, errors in answering point queries (e.g., how often has element x appeared in the stream described by the sketch?) are often of the form: "with probability p >= 1 - delta, the estimate is close to the truth by some factor depending on eps."The type
K
is the type of items you want to count. You must provide an implicitCMSHasher[K]
forK
, and Algebird ships with several such implicits for commonly used types such asLong
and scala.BigInt.If your type
K
is not supported out of the box, you have two options: 1) You provide a "translation" function to convert items of your (unsupported) typeK
to a supported type such as Double, and then use thecontramap
function of CMSHasher to create the requiredCMSHasher[K]
for your type (see the documentation of CMSHasher for an example); 2) You implement aCMSHasher[K]
from scratch, using the existing CMSHasher implementations as a starting point.Note: Because Arrays in Scala/Java not have sane
equals
andhashCode
implementations, you cannot safely use types such asArray[Byte]
. Extra work is required for Arrays. For example, you may opt to convertArray[T]
to aSeq[T]
viatoSeq
, or you can provide appropriate wrapper classes. Algebird provides one such wrapper class, Bytes, to safely wrap anArray[Byte]
for use with CMS.The type used to identify the elements to be counted. For example, if you want to count the occurrence of user names, you could map each username to a unique numeric ID expressed as a
Long
, and then count the occurrences of thoseLong
s with a CMS of typeK=Long
. Note that this mapping between the elements of your problem domain and their identifiers used for counting via CMS should be bijective. We require a CMSHasher context bound forK
, see CMSHasherImplicits for available implicits that can be imported. Which type K should you pick in practice? For domains that have less than2^64
unique elements, you'd typically use
Long. For larger domains you can try scala.BigInt, for example. Other possibilities include Spire's
SafeLongand
Numericaldata types (https://github.com/non/spire), though Algebird does not include the required implicits for CMS-hashing (cf. CMSHasherImplicits.