Exponential Histogram
The ExpHist
data structure implements the Exponential Histogram algorithm from Maintaining Stream Statistics over Sliding Windows, by Datar, Gionis, Indyk and Motwani.
An Exponential Histogram is a sliding window counter that can guarantee a bounded relative error. You configure the data structure with
epsilon
, the relative error you’re willing to toleratewindowSize
, the number of time ticks that you want to track
You interact with the data structure by adding (number, timestamp) pairs into the exponential histogram. querying it for an approximate counts with guess
.
The approximate count is guaranteed to be within conf.epsilon
relative error of the true count seen across the supplied windowSize
.
Example Usage
Let’s set up a bunch of buckets to add into our exponential histogram. Each bucket tracks a delta and a timestamp. This example uses the same number for both, for simplicity.
import com.twitter.algebird.ExpHist
// import com.twitter.algebird.ExpHist
import ExpHist.{ Bucket, Config, Timestamp }
// import ExpHist.{Bucket, Config, Timestamp}
val maxTimestamp = 200
// maxTimestamp: Int = 200
val inputs = (1 to maxTimestamp).map {
i => ExpHist.Bucket(i, Timestamp(i))
}.toVector
// inputs: Vector[com.twitter.algebird.ExpHist.Bucket] = Vector(Bucket(1,Timestamp(1)), Bucket(2,Timestamp(2)), Bucket(3,Timestamp(3)), Bucket(4,Timestamp(4)), Bucket(5,Timestamp(5)), Bucket(6,Timestamp(6)), Bucket(7,Timestamp(7)), Bucket(8,Timestamp(8)), Bucket(9,Timestamp(9)), Bucket(10,Timestamp(10)), Bucket(11,Timestamp(11)), Bucket(12,Timestamp(12)), Bucket(13,Timestamp(13)), Bucket(14,Timestamp(14)), Bucket(15,Timestamp(15)), Bucket(16,Timestamp(16)), Bucket(17,Timestamp(17)), Bucket(18,Timestamp(18)), Bucket(19,Timestamp(19)), Bucket(20,Timestamp(20)), Bucket(21,Timestamp(21)), Bucket(22,Timestamp(22)), Bucket(23,Timestamp(23)), Bucket(24,Timestamp(24)), Bucket(25,Timestamp(25)), Bucket(26,Timestamp(26)), Bucket(27,Timestamp(27)), Bucket(28,Timestamp(28)), Bucket(29,Timestamp(29)), ...
val actualSum = inputs.map(_.size).sum
// actualSum: Long = 20100
Now we’ll configure an instance of ExpHist
to track the count and add each of our buckets in.
val epsilon = 0.01
// epsilon: Double = 0.01
val windowSize = maxTimestamp
// windowSize: Int = 200
val eh = ExpHist.empty(Config(epsilon, windowSize))
// eh: com.twitter.algebird.ExpHist = ExpHist(Config(0.01,200),Vector(),0,Timestamp(0))
val full = inputs.foldLeft(eh) {
case (histogram, Bucket(delta, timestamp)) => histogram.add(delta, timestamp)
}
// full: com.twitter.algebird.ExpHist = ExpHist(Config(0.01,200),Vector(Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), Bucket(1,Timestamp(200)), ...
Now we can query the full exponential histogram and compare the guess to the actual sum:
val approximateSum = full.guess
// approximateSum: Double = 19972.5
full.relativeError
// res0: Double = 0.006424792139077854
val maxError = actualSum * full.relativeError
// maxError: Double = 129.13832199546485
assert(full.guess <= actualSum + maxError)
assert(full.guess >= actualSum - maxError)
l-Canonical Representation
The exponential histogram algorithm tracks buckets of size 2^i
. Every new increment to the histogram adds a bucket of size 1.
Because only l
or l+1
buckets of size 2^i
are allowed for each i
, this increment might trigger an incremental merge of smaller buckets into larger buckets.
Let’s look at 10 steps of the algorithm with l == 2
:
1: 1 (1 added)
2: 1 1 (1 added)
3: 1 1 1 (1 added)
4: 1 1 2 (1 added, triggering a 1 + 1 = 2 merge)
5: 1 1 1 2 (1 added)
6: 1 1 2 2 (1 added, triggering a 1 + 1 = 2 merge)
7: 1 1 1 2 2 (1 added)
8: 1 1 2 2 2 (1 added, triggering a 1 + 1 = 2 merge AND a 2 + 2 = 4 merge)
9: 1 1 1 2 2 2 (1 added)
10: 1 1 2 2 4 (1 added, triggering a 1 + 1 = 2 merge AND a 2 + 2 = 4 merge)
Notice that the bucket sizes always sum to the algorithm step, ie 10 == 1 + 1 + 1 + 2 + 2 + 4
.
Now let’s write out a list of the number of buckets of each size, ie [bucketsOfSize(1), bucketsOfSize(2), bucketsOfSize(4), ....]
. Here’s the above sequence in the new representation, plus a few more steps:
1: 1 <-- (l + 1)2^0 - l = 3 * 2^0 - 2 = 1
2: 2
3: 3
4: 2 1 <-- (l + 1)2^1 - l = 3 * 2^0 - 2 = 4
5: 3 1
6: 2 2
7: 3 2
8: 2 3
9: 3 3
10: 2 2 1 <-- (l + 1)2^2 - l = 3 * 2^0 - 2 = 10
11: 3 2 1
12: 2 3 1
13: 3 3 1
14: 2 2 2
15: 3 2 2
16: 2 3 2
16: 3 3 2
17: 2 2 3
This sequence is called the “l-canonical representation” of s
.
A pattern emerges! Every bucket size except the largest looks like a binary counter (if you added l + 1
to the bit, and made the counter little-endian). Let’s call this the “binary” prefix, or bin(_)
.
Here’s the above sequence with the prefix decoded from “binary”:
1: 1 <-- (l + 1)2^0 - l = 3 * 2^0 - 2 = 1
2: 2
3: 3
4: bin(0) 1 <-- (l + 1)2^1 - l = 3 * 2^0 - 2 = 4
5: bin(1) 1
6: bin(0) 2
7: bin(1) 2
8: bin(0) 3
9: bin(1) 3
10: bin(0) 1 <-- (l + 1)2^2 - l = 3 * 2^0 - 2 = 10
11: bin(1) 1
12: bin(2) 1
13: bin(3) 1
14: bin(0) 2
15: bin(1) 2
16: bin(2) 2
16: bin(3) 2
17: bin(0) 3
Some observations about the pattern:
The l-canonical representation groups the natural numbers into groups of size (l + 1)2^i
for i >= 0
.
Each group starts at (l + 1)2^i - l
(see 1, 4, 10… above)
Within each group, the “binary” prefix of the l-canonical rep cycles from 0
to (2^i - 1)
, l + 1
total times. (This makes sense; each cycle increments the final entry by one until it hits l + 1
; after that an increment triggers a merge and a new “group” begins.)
The final l-canonical entry == floor((position within the group) / 2^i)
, or the “quotient” of that position and 2^i
.
That’s all we need to know to write a procedure to generate the l-canonical representation! Here it is again:
L-Canonical Representation Procedure
- Find the largest
j
s.t.2^j <= (s + l) / (1 + l)
- let
s' := 2^j(1 + l) - l
(s'
is the position if the start of a group, ie 1, 4, 10…)
diff := (s - s')
is the position of s within that group.- let
b :=
the little-endian binary rep ofdiff % (2^j - 1)
- let
ret :=
return vector of lengthj
:
(0 until j).map { i => ret(i) = b(i) + l }
ret(j) = math.floor(diff / 2^j)