HyperLogLog
HyperLogLog is an approximation algorithm that estimates the number of distinct elements in a multiset using a constant amount of memory. (See https://en.wikipedia.org/wiki/HyperLogLog)
In Algebird, the HLL class can be used like a set. You can create an HLL from an element, and you can combine HLLs using the +
operation, which is like set union. Like most structures in Algebird, HLLs form a monoid, where set union (+
) is the addition operation.
Unlike a normal set, HLL cannot lookup whether elements are in the set. If you want to do this, consider using a BloomFilter, which can check elements for set membership, but is less memory-efficient than a HLL.
Note: Internally, HLL represents its elements using hashes, which are internally represented as
Array[Byte]
. For this reason, some HLL methods take anArray[Byte]
, assuming that the user will convert the object to anArray[Byte]
. However, to make using the HLLs easier, there are also generic methods that take any typeK
as long as there’s evidence of aHash128[K]
, which represents a 128-bit hash function for objects of typeK
.
HLLs cannot be instantiated directly. Instead, you must create them using one of the following strategies:
HyperLogLogMonoid
The HyperLogLogMonoid
class is the simplest way to create HLLs. HyperLogLogMonoid
is quite barebones, so, in most situations, it makes sense to use the HyperLogLogAggregator
methods mentioned below.
Step 1: Create a HyperLogLogMonoid
The HyperLogLogMonoid constructor takes an Int bits
, which represents the number of bits of the hash function that the HLL uses. The more bits you use, the more space the HLLs will take up, and the more precise your estimates will be. For a better understanding of the space-to-accuracy trade-off, see this table or use one of the other strategies mentioned below, which allow you to specify the desired error.
import com.twitter.algebird._
// import com.twitter.algebird._
val hllMonoid = new HyperLogLogMonoid(bits = 4)
// hllMonoid: com.twitter.algebird.HyperLogLogMonoid = com.twitter.algebird.HyperLogLogMonoid@70816c74
Step 2: Create HLLs from your data
HyperLogLogMonoid has a create
method which takes a hashed element (as a Array[Byte]
) and returns an HLL
which represents a set containing the given element.
We can create an HLL containing a list of elements by creating HLLs for each element using the create
method, and combining the elements using the HyperLogLogMonoid’s sum
method.
import com.twitter.algebird.HyperLogLog.int2Bytes
// import com.twitter.algebird.HyperLogLog.int2Bytes
val data = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
// data: List[Int] = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
val hlls = data.map { hllMonoid.create(_) }
// hlls: List[com.twitter.algebird.HLL] = List(SparseHLL(4,Map(2 -> Max(1))), SparseHLL(4,Map(2 -> Max(1))), SparseHLL(4,Map(2 -> Max(2))), SparseHLL(4,Map(2 -> Max(2))), SparseHLL(4,Map(13 -> Max(1))), SparseHLL(4,Map(13 -> Max(1))), SparseHLL(4,Map(4 -> Max(1))), SparseHLL(4,Map(4 -> Max(1))), SparseHLL(4,Map(12 -> Max(1))), SparseHLL(4,Map(12 -> Max(1))))
val combinedHLL = hllMonoid.sum(hlls)
// combinedHLL: com.twitter.algebird.HLL = DenseHLL(4,Bytes(0,0,2,0,1,0,0,0,0,0,0,0,1,1,0,0))
Note that we were able to call hllMonoid.create
on an Int
because we imported the implicit conversion int2Bytes
that converts an Int
into a Array[Byte]
.
Step 3: Approximating set cardinality using an HLL
We can use the sizeOf
method to estimate the approximate number of distinct elements in the multiset.
val approxSizeOf = hllMonoid.sizeOf(combinedHLL)
// approxSizeOf: com.twitter.algebird.Approximate[Long] = Approximate(1,4,9,0.9972)
HyperLogLogAggregator
The HyperLogLogAggregator
object contains some methods that make it easy to instantiate Algebird Aggregators that internally use HLLs.
We will only mention the generic Aggregators here. The generic aggregators require that we have an implicit Hash128[K]
instance. Algebird has instances of Hash128
for many common types such as Int
, Long
and String
.
To learn more about the Array[Byte]
aggregators, see the source code of HyperLogLog.
HyperLogLogAggregator.withErrorGeneric[K: Hash128](error: Double)
This is an aggregator of type Aggregator[K, HLL, HLL]
, which means that it builds a HLL
from a TraversableOnce
of K
s.
It takes an error
, which must be a Double in the range (0,1).
val agg = HyperLogLogAggregator.withErrorGeneric[Int](0.01)
// agg: com.twitter.algebird.GenHLLAggregator[Int] = GenHLLAggregator(com.twitter.algebird.HyperLogLogMonoid@5f19c35f,com.twitter.algebird.Hash128$$anon$5@a769e28)
val data = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
// data: List[Int] = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
val combinedHll: HLL = agg(data)
// combinedHll: com.twitter.algebird.HLL = SparseHLL(14,Map(10388 -> Max(1), 9650 -> Max(1), 15708 -> Max(2), 10082 -> Max(3), 6205 -> Max(1)))
HyperLogLogAggregator.withBits[K: Hash128](bits: Int)
Similar to withErrorGeneric
, but takes the number of bits as an Int.
val agg = HyperLogLogAggregator.withBits[Int](9)
// agg: com.twitter.algebird.GenHLLAggregator[Int] = GenHLLAggregator(com.twitter.algebird.HyperLogLogMonoid@7f9a7f3a,com.twitter.algebird.Hash128$$anon$5@a769e28)
val data = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
// data: List[Int] = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
val combinedHll: HLL = agg(data)
// combinedHll: com.twitter.algebird.HLL = SparseHLL(9,Map(348 -> Max(2), 61 -> Max(3), 148 -> Max(3), 434 -> Max(2), 354 -> Max(1)))
HyperLogLogAggregator.sizeWithErrorGeneric[K: Hash128](err: Double)
This is an aggregator of type Aggregator[K, HLL, Long]
, which means that it presents a Long value. The Long that it returns is the estimated size of the combined HLL.
val agg = HyperLogLogAggregator.sizeWithErrorGeneric[Int](0.01)
// agg: com.twitter.algebird.MonoidAggregator[Int,com.twitter.algebird.HLL,Long] = com.twitter.algebird.MonoidAggregator$$anon$6@2c9c4584
val data = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
// data: List[Int] = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
val approximateSize: Long = agg(data)
// approximateSize: Long = 5
REPL Tour
import HyperLogLog._
// import HyperLogLog._
val hll = new HyperLogLogMonoid(4)
// hll: com.twitter.algebird.HyperLogLogMonoid = com.twitter.algebird.HyperLogLogMonoid@7e5c2173
val data = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
// data: List[Int] = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
val seqHll = data.map { hll(_) }
// <console>:21: warning: method apply in class HyperLogLogMonoid is deprecated: Use toHLL
// val seqHll = data.map { hll(_) }
// ^
// seqHll: List[com.twitter.algebird.HLL] = List(SparseHLL(4,Map(2 -> Max(1))), SparseHLL(4,Map(2 -> Max(1))), SparseHLL(4,Map(2 -> Max(2))), SparseHLL(4,Map(2 -> Max(2))), SparseHLL(4,Map(13 -> Max(1))), SparseHLL(4,Map(13 -> Max(1))), SparseHLL(4,Map(4 -> Max(1))), SparseHLL(4,Map(4 -> Max(1))), SparseHLL(4,Map(12 -> Max(1))), SparseHLL(4,Map(12 -> Max(1))))
val sumHll = hll.sum(seqHll)
// sumHll: com.twitter.algebird.HLL = DenseHLL(4,Bytes(0,0,2,0,1,0,0,0,0,0,0,0,1,1,0,0))
val approxSizeOf = hll.sizeOf(sumHll)
// approxSizeOf: com.twitter.algebird.Approximate[Long] = Approximate(1,4,9,0.9972)
val actualSize = data.toSet.size
// actualSize: Int = 5
val estimate = approxSizeOf.estimate
// estimate: Long = 4
Documentation Help
We’d love your help fleshing out this documentation! You can edit this page in your browser by clicking this link. These links might be helpful: