Averaged Value
The AveragedValue
data structure keeps track of the count
and mean
of a stream of numbers with a single pass over the data. The mean calculation uses a numerically stable online algorithm suitable for large numbers of records, similar to Chan et. al.’s parallel variance algorithm on Wikipedia. As long as your count doesn’t overflow a Long
, the mean calculation won’t overflow.
You can build instances of AveragedValue
from any numeric type:
import com.twitter.algebird._
// import com.twitter.algebird._
val longVal = AveragedValue(3L)
// longVal: com.twitter.algebird.AveragedValue = AveragedValue(1,3.0)
val doubleVal = AveragedValue(12.0)
// doubleVal: com.twitter.algebird.AveragedValue = AveragedValue(1,12.0)
val intVal = AveragedValue(15)
// intVal: com.twitter.algebird.AveragedValue = AveragedValue(1,15.0)
Algebraic Properties
Combining instances with +
generates a new instance by adding the count
s and averaging the value
s:
longVal + doubleVal
// res0: com.twitter.algebird.AveragedValue = AveragedValue(2,7.5)
longVal + doubleVal + intVal
// res1: com.twitter.algebird.AveragedValue = AveragedValue(3,10.0)
You can also add numbers directly to an AveragedValue
instance:
longVal + 12
// res2: com.twitter.algebird.AveragedValue = AveragedValue(2,7.5)
AveragedValue
is a commutative group. This means you can add instances in any order:
longVal + doubleVal == doubleVal + doubleVal
// res3: Boolean = false
An AveragedValue
with a count and value of 0
act as Monoid.zero
:
Monoid.zero[AveragedValue]
// res4: com.twitter.algebird.AveragedValue = AveragedValue(0,0.0)
longVal + Monoid.zero[AveragedValue] == longVal
// res5: Boolean = true
Subtracting AveragedValue
s is the opposite of addition:
intVal - longVal
// res6: com.twitter.algebird.AveragedValue = AveragedValue(0,0.0)
intVal + doubleVal - doubleVal
// res7: com.twitter.algebird.AveragedValue = AveragedValue(1,15.0)
Stable Average Algorithm
Each AveragedValue
instance represents a stream of data. AveragedValue
uses Chan et. al.’s parallel algorithm to combine the mean values of each stream. Here’s the calculation:
// big and small are two AveragedValue instances
val deltaOfMeans = big.value - small.value
val newCount = big.count + small.count
val newMean = small.value + deltaOfMeans * (big.count / newCount)
As long as newCount
stays within the bounds of Long
, this calculation won’t overflow for large value
.
The Wikipedia presentation of this algorithm points out that if both streams are close in size, the algorithm is “prone to loss of precision due to catastrophic cancellation”.
AveragedValue
guards against this instability by taking a weighted average of the two instances if the smaller of the two has a count greater than 10%
of the combined count.
val newCount = big.count + small.count
(big.count * big.value + small.count * small.value) / newCount
Aggregator
AveragedValue.aggregator
returns an Aggregator
that uses AveragedValue
to calculate the mean of all Double
values in a stream. For example:
val items = List[Double](1.0, 2.2, 3.3, 4.4, 5.5)
// items: List[Double] = List(1.0, 2.2, 3.3, 4.4, 5.5)
AveragedValue.aggregator(items)
// res8: Double = 3.28
AveragedValue.numericAggregator
works the same way for any numeric type:
val items = List[Int](1, 3, 5, 7)
// items: List[Int] = List(1, 3, 5, 7)
AveragedValue.numericAggregator[Int].apply(items)
// res9: Double = 4.0
Related Data Structures
- DecayedValue calculates a decayed moving average of a stream; this allows you to approximate the mean of a bounded sliding window of the datastream.
- In addition to count and mean, Moments allows you to keep track of the standard deviation, skewness and kurtosis of a stream with one pass over the data.
Links
- Source: AveragedValue.scala
- Tests: AveragedValueLaws.scala
- Scaladoc: AveragedValue
Documentation Help
We’d love your help fleshing out this documentation! You can edit this page in your browser by clicking this link.