Use Algebird Aggregator to do the reduction
Use Algebird Aggregator to do the reduction
This does the partial heap sort followed by take in memory on the mappers before sending to the mappers.
This does the partial heap sort followed by take in memory on the mappers before sending to the mappers. This is a big help if there are relatively few keys and n is relatively small.
Smaller is about average values/key not total size (that does not matter, but is clearly related).
Smaller is about average values/key not total size (that does not matter, but is clearly related).
Note that from the type signature we see that the right side is iterated (or may be) over and over, but the left side is not. That means that you want the side with fewer values per key on the right. If both sides are similar, no need to worry. If one side is a one-to-one mapping, that should be the "smaller" side.
For each key, count the number of values that satisfy a predicate
For each key, count the number of values that satisfy a predicate
For each key, give the number of unique values.
For each key, give the number of unique values. WARNING: May OOM. This assumes the values for each key can fit in memory.
For each key, remove duplicate values.
For each key, remove duplicate values. WARNING: May OOM. This assumes the values for each key can fit in memory.
For each key, selects all elements except first n ones.
For each key, selects all elements except first n ones.
For each key, Drops longest prefix of elements that satisfy the given predicate.
For each key, Drops longest prefix of elements that satisfy the given predicate.
.filter(fn).toTypedPipe == .toTypedPipe.filter(fn) It is generally better to avoid going back to a TypedPipe as long as possible: this minimizes the times we go in and out of cascading/hadoop types.
.filter(fn).toTypedPipe == .toTypedPipe.filter(fn) It is generally better to avoid going back to a TypedPipe as long as possible: this minimizes the times we go in and out of cascading/hadoop types.
filter keys on a predicate.
filter keys on a predicate. More efficient than filter if you are only looking at keys
Similar to mapValues, but works like flatMap, returning a collection of outputs for each value input.
Similar to mapValues, but works like flatMap, returning a collection of outputs for each value input.
flatten the values Useful after sortedTake, for instance
flatten the values Useful after sortedTake, for instance
Folds are composable aggregations that make one pass over the data.
Folds are composable aggregations that make one pass over the data. If you need to do several custom folds over the same data, use Fold.join and this method
For each key, fold the values.
For each key, fold the values. see scala.collection.Iterable.foldLeft
If the fold depends on the key, use this method to construct the fold for each key
If the fold depends on the key, use this method to construct the fold for each key
For each key, check to see if a predicate is true for all Values
For each key, check to see if a predicate is true for all Values
This is just short hand for mapValueStream(identity), it makes sure the planner sees that you want to force a shuffle.
This is just short hand for mapValueStream(identity), it makes sure the planner sees that you want to force a shuffle. For expert tuning
This fully replicates this entire Grouped to the argument: mapside.
This fully replicates this entire Grouped to the argument: mapside. This means that we never see the case where the key is absent in the pipe. This means implementing a right-join (from the pipe) is impossible. Note, there is no reduce-phase in this operation. The next issue is that obviously, unlike a cogroup, for a fixed key, each joiner will NOT See all the tuples with those keys. This is because the keys on the left are distributed across many machines See hashjoin: http://docs.cascading.org/cascading/2.0/javadoc/cascading/pipe/HashJoin.html
Use this to get the first value encountered.
Use this to get the first value encountered. prefer this to take(1).
A HashJoinable has a single input into to the cogroup
A HashJoinable has a single input into to the cogroup
This is just an identity that casts the result to V1
This is just an identity that casts the result to V1
Convert to a TypedPipe and only keep the keys
Convert to a TypedPipe and only keep the keys
Operate on an Iterator[T] of all the values for each key at one time.
Operate on an Iterator[T] of all the values for each key at one time. Prefer this to toList, when you can avoid accumulating the whole list in memory. Prefer sum, which is partially executed map-side by default. Use mapValueStream when you don't care about the key for the group.
Iterator is always Non-empty. Note, any key that has all values removed will not appear in subsequent .mapGroup/mapValueStream
Use this when you don't care about the key for the group, otherwise use mapGroup
Use this when you don't care about the key for the group, otherwise use mapGroup
This is a special case of mapValueStream, but can be optimized because it doesn't need all the values for a given key at once.
This is a special case of mapValueStream, but can be optimized because it doesn't need all the values for a given key at once. An unoptimized implementation is: mapValueStream { _.map { fn } } but for Grouped we can avoid resorting to mapValueStream
Note, this satisfies KeyedPipe.mapped: TypedPipe[(K, Any)]
Note, this satisfies KeyedPipe.mapped: TypedPipe[(K, Any)]
For each key, give the maximum value
For each key, give the maximum value
For each key, give the maximum value by some function
For each key, give the maximum value by some function
For each key, give the minimum value
For each key, give the minimum value
For each key, give the minimum value by some function
For each key, give the minimum value by some function
For each key, Return the product of all the values
For each key, Return the product of all the values
reduce with fn which must be associative and commutative.
reduce with fn which must be associative and commutative. Like the above this can be optimized in some Grouped cases. If you don't have a commutative operator, use reduceLeft
Similar to reduce but always on the reduce-side (never optimized to mapside), and named for the scala function.
Similar to reduce but always on the reduce-side (never optimized to mapside), and named for the scala function. fn need not be associative and/or commutative. Makes sense when you want to reduce, but in a particular sorted order. the old value comes in on the left.
For each key, scanLeft the values.
For each key, scanLeft the values. see scala.collection.Iterable.scanLeft
For each key, give the number of values
For each key, give the number of values
Like the above, but with a less than operation for the ordering
Like the above, but with a less than operation for the ordering
Take the largest k things according to the implicit ordering.
Take the largest k things according to the implicit ordering. Useful for top-k without having to call ord.reverse
This implements bottom-k (smallest k items) on each mapper for each key, then sends those to reducers to get the result.
This implements bottom-k (smallest k items) on each mapper for each key, then sends those to reducers to get the result. This is faster than using .take if k * (number of Keys) is small enough to fit in memory.
Add all items according to the implicit Semigroup If there is no sorting, we default to assuming the Semigroup is commutative.
Add all items according to the implicit Semigroup If there is no sorting, we default to assuming the Semigroup is commutative. If you don't want that, define an ordering on the Values, sort or .forceToReducers.
Semigroups MAY have a faster implementation of sum for iterators, so prefer using sum/sumLeft to reduce
Semigroups MAY have a faster implementation of sum for iterators, so prefer using sum/sumLeft to reduce/reduceLeft
Semigroups MAY have a faster implementation of sum for iterators, so prefer using sum/sumLeft to reduce/reduceLeft
For each key, Selects first n elements.
For each key, Selects first n elements. Don't use this if n == 1, head is faster in that case.
For each key, Takes longest prefix of elements that satisfy the given predicate.
For each key, Takes longest prefix of elements that satisfy the given predicate.
AVOID THIS IF POSSIBLE For each key, accumulate all the values into a List.
AVOID THIS IF POSSIBLE For each key, accumulate all the values into a List. WARNING: May OOM Only use this method if you are sure all the values will fit in memory. You really should try to ask why you need all the values, and if you want to do some custom reduction, do it in mapGroup or mapValueStream
AVOID THIS IF POSSIBLE Same risks apply here as to toList: you may OOM.
AVOID THIS IF POSSIBLE Same risks apply here as to toList: you may OOM. See toList. Note that toSet needs to be parameterized even though toList does not. This is because List is covariant in its type parameter in the scala API, but Set is invariant. See: http://stackoverflow.com/questions/676615/why-is-scalas-immutable-set-not-covariant-in-its-type
End of the operations on values.
End of the operations on values. From this point on the keyed structure is lost and another shuffle is generally required to reconstruct it
Convert to a TypedPipe and only keep the values
Convert to a TypedPipe and only keep the values
never mutates this, instead returns a new item.
never mutates this, instead returns a new item.
never mutates this, instead returns a new item.
never mutates this, instead returns a new item.