Class

com.twitter.scalding

CoGroupBuilder

Related Doc: package scalding

Permalink

class CoGroupBuilder extends GroupBuilder

Builder classes used internally to implement coGroups (joins). Can also be used for more generalized joins, e.g., star joins.

Source
CoGroupBuilder.scala
Linear Supertypes
Type Hierarchy
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. CoGroupBuilder
  2. GroupBuilder
  3. StreamOperations
  4. FoldOperations
  5. Sortable
  6. ReduceOperations
  7. Serializable
  8. AnyRef
  9. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new CoGroupBuilder(groupFields: Fields, joinMode: JoinMode)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def aggregate[A, B, C](fieldDef: (Fields, Fields))(ag: Aggregator[A, B, C])(implicit startConv: TupleConverter[A], middleSetter: TupleSetter[B], middleConv: TupleConverter[B], endSetter: TupleSetter[C]): GroupBuilder

    Permalink

    Pretty much a synonym for mapReduceMap with the methods collected into a trait.

    Pretty much a synonym for mapReduceMap with the methods collected into a trait.

    Definition Classes
    ReduceOperations
  5. def approximateUniqueCount[T](f: (Fields, Fields), errPercent: Double = 1.0)(implicit arg0: (T) ⇒ Array[Byte], arg1: TupleConverter[T]): GroupBuilder

    Permalink

    Approximate number of unique values We use about m = (104/errPercent)^2 bytes of memory per key Uses .toString.getBytes to serialize the data so you MUST ensure that .toString is an equivalance on your counted fields (i.e. x.toString == y.toString if and only if x == y)

    Approximate number of unique values We use about m = (104/errPercent)^2 bytes of memory per key Uses .toString.getBytes to serialize the data so you MUST ensure that .toString is an equivalance on your counted fields (i.e. x.toString == y.toString if and only if x == y)

    For each key:

    10% error ~ 256 bytes
    5% error ~ 1kB
    2% error ~ 4kB
    1% error ~ 16kB
    0.5% error ~ 64kB
    0.25% error ~ 256kB
    Definition Classes
    ReduceOperations
  6. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  7. def average(f: Symbol): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  8. def average(f: (Fields, Fields)): GroupBuilder

    Permalink

    uses a more stable online algorithm which should be suitable for large numbers of records

    uses a more stable online algorithm which should be suitable for large numbers of records

    Similar To

    http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm

    Definition Classes
    ReduceOperations
  9. def buffer(args: Fields)(b: Buffer[_]): GroupBuilder

    Permalink

    This may significantly reduce performance of your job.

    Warning

    This may significantly reduce performance of your job. It kills the ability to do map-side aggregation.

    Definition Classes
    GroupBuilder
  10. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  11. def coGroup(f: Fields, p: Pipe, j: JoinMode = InnerJoinMode): CoGroupBuilder

    Permalink
  12. var coGroups: List[(Fields, Pipe, JoinMode)]

    Permalink
    Attributes
    protected
  13. def count[T](fieldDef: (Fields, Fields))(fn: (T) ⇒ Boolean)(implicit arg0: TupleConverter[T]): GroupBuilder

    Permalink

    This is count with a predicate: only counts the tuples for which fn(tuple) is true

    This is count with a predicate: only counts the tuples for which fn(tuple) is true

    Definition Classes
    ReduceOperations
  14. def dot[T](left: Fields, right: Fields, result: Fields)(implicit ttconv: TupleConverter[(T, T)], ring: Ring[T], tconv: TupleConverter[T], tset: TupleSetter[T]): GroupBuilder

    Permalink

    First do "times" on each pair, then "plus" them all together.

    First do "times" on each pair, then "plus" them all together.

    Example

    groupBy('x) { _.dot('y,'z, 'ydotz) }
    Definition Classes
    ReduceOperations
  15. def drop(cnt: Int): GroupBuilder

    Permalink

    Remove the first cnt elements

    Remove the first cnt elements

    Definition Classes
    StreamOperations
  16. def dropWhile[T](f: Fields)(fn: (T) ⇒ Boolean)(implicit conv: TupleConverter[T]): GroupBuilder

    Permalink

    Drop while the predicate is true, starting at the first false, output all

    Drop while the predicate is true, starting at the first false, output all

    Definition Classes
    StreamOperations
  17. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  18. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  19. def every(ev: (Pipe) ⇒ Every): GroupBuilder

    Permalink

    Prefer aggregateBy operations!

    Prefer aggregateBy operations!

    Definition Classes
    GroupBuilder
  20. var evs: List[(Pipe) ⇒ Every]

    Permalink

    This is the description of this Grouping in terms of a sequence of Every operations

    This is the description of this Grouping in terms of a sequence of Every operations

    Attributes
    protected
    Definition Classes
    GroupBuilder
  21. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  22. def foldLeft[X, T](fieldDef: (Fields, Fields))(init: X)(fn: (X, T) ⇒ X)(implicit setter: TupleSetter[X], conv: TupleConverter[T]): GroupBuilder

    Permalink

    Prefer reduce or mapReduceMap.

    Prefer reduce or mapReduceMap. foldLeft will force all work to be done on the reducers. If your function is not associative and commutative, foldLeft may be required.

    Best Practice

    Make sure init is an immutable object.

    Note

    Init needs to be serializable with Kryo (because we copy it for each grouping to avoid possible errors using a mutable init object).

    Definition Classes
    GroupBuilderFoldOperations
  23. def forall[T](fieldDef: (Fields, Fields))(fn: (T) ⇒ Boolean)(implicit arg0: TupleConverter[T]): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  24. def forceToReducers: GroupBuilder

    Permalink

    This cancels map side aggregation and forces everything to the reducers

    This cancels map side aggregation and forces everything to the reducers

    Definition Classes
    GroupBuilder
  25. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  26. val groupFields: Fields

    Permalink
    Definition Classes
    GroupBuilder
  27. def groupMode: GroupMode

    Permalink
    Definition Classes
    GroupBuilder
  28. def groupedPipeOf(name: String, in: Pipe): GroupBy

    Permalink
    Attributes
    protected
    Definition Classes
    GroupBuilder
  29. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  30. def head(f: Symbol*): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  31. def head(fd: (Fields, Fields)): GroupBuilder

    Permalink

    Return the first, useful probably only for sorted case.

    Return the first, useful probably only for sorted case.

    Definition Classes
    ReduceOperations
  32. def histogram(f: (Fields, Fields), binWidth: Double = 1.0): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  33. def hyperLogLog[T](f: (Fields, Fields), errPercent: Double = 1.0)(implicit arg0: (T) ⇒ Array[Byte], arg1: TupleConverter[T]): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  34. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  35. var isReversed: Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    GroupBuilder
  36. def last(f: Symbol*): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  37. def last(fd: (Fields, Fields)): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  38. def mapList[T, R](fieldDef: (Fields, Fields))(fn: (List[T]) ⇒ R)(implicit conv: TupleConverter[T], setter: TupleSetter[R]): GroupBuilder

    Permalink

    Collect all the values into a List[T] and then operate on that list.

    Collect all the values into a List[T] and then operate on that list. This fundamentally uses as much memory as it takes to store the list. This gives you the list in the reverse order it was encounted (it is built as a stack for efficiency reasons). If you care about order, call .reverse in your fn

    STRONGLY PREFER TO AVOID THIS. Try reduce or plus and an O(1) memory algorithm.

    Definition Classes
    FoldOperationsReduceOperations
  39. def mapPlusMap[T, X, U](fieldDef: (Fields, Fields))(mapfn: (T) ⇒ X)(mapfn2: (X) ⇒ U)(implicit startConv: TupleConverter[T], middleSetter: TupleSetter[X], middleConv: TupleConverter[X], endSetter: TupleSetter[U], sgX: Semigroup[X]): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  40. def mapReduceMap[T, X, U](fieldDef: (Fields, Fields))(mapfn: (T) ⇒ X)(redfn: (X, X) ⇒ X)(mapfn2: (X) ⇒ U)(implicit startConv: TupleConverter[T], middleSetter: TupleSetter[X], middleConv: TupleConverter[X], endSetter: TupleSetter[U]): GroupBuilder

    Permalink

    Type T is the type of the input field (input to map, T => X)

    Type T is the type of the input field (input to map, T => X)

    Type X is the intermediate type, which your reduce function operates on (reduce is (X,X) => X)

    Type U is the final result type, (final map is: X => U)

    The previous output goes into the reduce function on the left, like foldLeft, so if your operation is faster for the accumulator to be on one side, be aware.

    Definition Classes
    GroupBuilderReduceOperations
  41. def mapStream[T, X](fieldDef: (Fields, Fields))(mapfn: (Iterator[T]) ⇒ TraversableOnce[X])(implicit conv: TupleConverter[T], setter: TupleSetter[X]): GroupBuilder

    Permalink

    Corresponds to a Cascading Buffer which allows you to stream through the data, keeping some, dropping, scanning, etc...

    Corresponds to a Cascading Buffer which allows you to stream through the data, keeping some, dropping, scanning, etc... The iterator you are passed is lazy, and mapping will not trigger the entire evaluation. If you convert to a list (i.e. to reverse), you need to be aware that memory constraints may become an issue.

    Warning

    Any fields not referenced by the input fields will be aligned to the first output, and the final hadoop stream will have a length of the maximum of the output of this, and the input stream. So, if you change the length of your inputs, the other fields won't be aligned. YOU NEED TO INCLUDE ALL THE FIELDS YOU WANT TO KEEP ALIGNED IN THIS MAPPING! POB: This appears to be a Cascading design decision.

    Warning

    mapfn needs to be stateless. Multiple calls needs to be safe (no mutable state captured)

    Definition Classes
    GroupBuilderStreamOperations
  42. def max(f: Symbol*): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  43. def max(fieldDef: (Fields, Fields)): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  44. def min(f: Symbol*): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  45. def min(fieldDef: (Fields, Fields)): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  46. def mkString(fieldDef: Symbol): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  47. def mkString(fieldDef: Symbol, sep: String): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  48. def mkString(fieldDef: Symbol, start: String, sep: String, end: String): GroupBuilder

    Permalink

    these will only be called if a tuple is not passed, meaning just one column

    these will only be called if a tuple is not passed, meaning just one column

    Definition Classes
    ReduceOperations
  49. def mkString(fieldDef: (Fields, Fields)): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  50. def mkString(fieldDef: (Fields, Fields), sep: String): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  51. def mkString(fieldDef: (Fields, Fields), start: String, sep: String, end: String): GroupBuilder

    Permalink

    Similar to the scala.collection.Iterable.mkString takes the source and destination fieldname, which should be a single field.

    Similar to the scala.collection.Iterable.mkString takes the source and destination fieldname, which should be a single field. The result will be start, each item.toString separated by sep, followed by end for convenience there several common variants below

    Definition Classes
    ReduceOperations
  52. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  53. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  54. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  55. def overrideDescription(p: Pipe): Pipe

    Permalink
    Attributes
    protected
    Definition Classes
    GroupBuilder
  56. def overrideReducers(p: Pipe): Pipe

    Permalink
    Attributes
    protected
    Definition Classes
    GroupBuilder
  57. def pass: GroupBuilder

    Permalink

    An identity function that keeps all the tuples.

    An identity function that keeps all the tuples. A hack to implement groupAll and groupRandomly.

    Definition Classes
    GroupBuilder
  58. def pivot(fieldDef: (Fields, Fields), defaultVal: Any = null): GroupBuilder

    Permalink

    Opposite of RichPipe.unpivot.

    Opposite of RichPipe.unpivot. See SQL/Excel for more on this function converts a row-wise representation into a column-wise one.

    Example

    pivot(('feature, 'value) -> ('clicks, 'impressions, 'requests))

    it will find the feature named "clicks", and put the value in the column with the field named clicks.

    Absent fields result in null unless a default value is provided. Unnamed output fields are ignored.

    Note

    Duplicated fields will result in an error.

    Hint

    if you want more precision, first do a

    map('value -> value) { x : AnyRef => Option(x) }

    and you will have non-nulls for all present values, and Nones for values that were present but previously null. All nulls in the final output will be those truly missing. Similarly, if you want to check if there are any items present that shouldn't be:

    map('feature -> 'feature) { fname : String =>
      if (!goodFeatures(fname)) { throw new Exception("ohnoes") }
      else fname
    }
    Definition Classes
    ReduceOperations
  59. def reduce[T](fieldDef: Symbol*)(fn: (T, T) ⇒ T)(implicit setter: TupleSetter[T], conv: TupleConverter[T]): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  60. def reduce[T](fieldDef: (Fields, Fields))(fn: (T, T) ⇒ T)(implicit setter: TupleSetter[T], conv: TupleConverter[T]): GroupBuilder

    Permalink

    Apply an associative/commutative operation on the left field.

    Apply an associative/commutative operation on the left field.

    Example

    reduce(('mass,'allids)->('totalMass, 'idset)) { (left:(Double,Set[Long]),right:(Double,Set[Long])) =>
      (left._1 + right._1, left._2 ++ right._2)
    }

    Equivalent to a mapReduceMap with trivial (identity) map functions.

    Assumed to be a commutative operation. If you don't want that, use .forceToReducers

    The previous output goes into the reduce function on the left, like foldLeft, so if your operation is faster for the accumulator to be on one side, be aware.

    Definition Classes
    ReduceOperations
  61. def reducers(r: Int): GroupBuilder

    Permalink

    Override the number of reducers used in the groupBy.

    Override the number of reducers used in the groupBy.

    Definition Classes
    GroupBuilder
  62. def reverse: GroupBuilder

    Permalink
    Definition Classes
    GroupBuilder
  63. def scanLeft[X, T](fieldDef: (Fields, Fields))(init: X)(fn: (X, T) ⇒ X)(implicit setter: TupleSetter[X], conv: TupleConverter[T]): GroupBuilder

    Permalink

    Analog of standard scanLeft (@see scala.collection.Iterable.scanLeft ) This invalidates map-side aggregation, forces all data to be transferred to reducers.

    Analog of standard scanLeft (@see scala.collection.Iterable.scanLeft ) This invalidates map-side aggregation, forces all data to be transferred to reducers. Use only if you REALLY have to.

    Best Practice

    Make sure init is an immutable object.

    Note

    init needs to be serializable with Kryo (because we copy it for each grouping to avoid possible errors using a mutable init object). We override the default implementation here to use Kryo to serialize the initial value, for immutable serializable inits, this is not needed

    Definition Classes
    GroupBuilderStreamOperations
  64. def schedule(name: String, pipe: Pipe): Pipe

    Permalink
    Definition Classes
    CoGroupBuilderGroupBuilder
  65. def setDescriptions(newDescriptions: Seq[String]): GroupBuilder

    Permalink

    Override the description to be used in .dot and MR step names.

    Override the description to be used in .dot and MR step names.

    Definition Classes
    GroupBuilder
  66. def size(thisF: Fields): GroupBuilder

    Permalink
    Definition Classes
    ReduceOperations
  67. def size: GroupBuilder

    Permalink

    How many values are there for this key

    How many values are there for this key

    Definition Classes
    ReduceOperations
  68. def sizeAveStdev(fieldDef: (Fields, Fields)): GroupBuilder

    Permalink

    Compute the count, ave and standard deviation in one pass example: g.sizeAveStdev('x -> ('cntx, 'avex, 'stdevx))

    Compute the count, ave and standard deviation in one pass example: g.sizeAveStdev('x -> ('cntx, 'avex, 'stdevx))

    Definition Classes
    ReduceOperations
  69. def sortBy(f: Fields): GroupBuilder

    Permalink

    This invalidates aggregateBy!

    This invalidates aggregateBy!

    Definition Classes
    GroupBuilderSortable
  70. var sortF: Option[Fields]

    Permalink
    Attributes
    protected
    Definition Classes
    GroupBuilder
  71. def sortWithTake[T](f: (Fields, Fields), k: Int)(lt: (T, T) ⇒ Boolean)(implicit arg0: TupleConverter[T]): GroupBuilder

    Permalink

    Equivalent to sorting by a comparison function then take-ing k items.

    Equivalent to sorting by a comparison function then take-ing k items. This is MUCH more efficient than doing a total sort followed by a take, since these bounded sorts are done on the mapper, so only a sort of size k is needed.

    Example

    sortWithTake( ('clicks, 'tweet) -> 'topClicks, 5) {
      fn : (t0 :(Long,Long), t1:(Long,Long) => t0._1 < t1._1 }

    topClicks will be a List[(Long,Long)]

    Definition Classes
    ReduceOperations
  72. def sortedReverseTake[T](f: (Fields, Fields), k: Int)(implicit conv: TupleConverter[T], ord: Ordering[T]): GroupBuilder

    Permalink

    Reverse of above when the implicit ordering makes sense.

    Reverse of above when the implicit ordering makes sense.

    Definition Classes
    ReduceOperations
  73. def sortedTake[T](f: (Fields, Fields), k: Int)(implicit conv: TupleConverter[T], ord: Ordering[T]): GroupBuilder

    Permalink

    Same as above but useful when the implicit ordering makes sense.

    Same as above but useful when the implicit ordering makes sense.

    Definition Classes
    ReduceOperations
  74. def sorting: Option[Fields]

    Permalink
    Definition Classes
    GroupBuilderSortable
  75. def spillThreshold(t: Int): GroupBuilder

    Permalink

    Override the spill threshold on AggregateBy

    Override the spill threshold on AggregateBy

    Definition Classes
    GroupBuilder
  76. def sum[T](fs: Symbol*)(implicit sg: Semigroup[T], tconv: TupleConverter[T], tset: TupleSetter[T]): GroupBuilder

    Permalink

    The same as sum(fs -> fs) Assumed to be a commutative operation.

    The same as sum(fs -> fs) Assumed to be a commutative operation. If you don't want that, use .forceToReducers

    Definition Classes
    ReduceOperations
  77. def sum[T](fd: (Fields, Fields))(implicit sg: Semigroup[T], tconv: TupleConverter[T], tset: TupleSetter[T]): GroupBuilder

    Permalink

    Use Semigroup.plus to compute a sum.

    Use Semigroup.plus to compute a sum. Not called sum to avoid conflicting with standard sum Your Semigroup[T] should be associated and commutative, else this doesn't make sense

    Assumed to be a commutative operation. If you don't want that, use .forceToReducers

    Definition Classes
    ReduceOperations
  78. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  79. def take(cnt: Int): GroupBuilder

    Permalink

    Only keep the first cnt elements

    Only keep the first cnt elements

    Definition Classes
    StreamOperations
  80. def takeWhile[T](f: Fields)(fn: (T) ⇒ Boolean)(implicit conv: TupleConverter[T]): GroupBuilder

    Permalink

    Take while the predicate is true, stopping at the first false.

    Take while the predicate is true, stopping at the first false. Output all taken elements.

    Definition Classes
    StreamOperations
  81. def thenDo(fn: (GroupBuilder) ⇒ GroupBuilder): GroupBuilder

    Permalink

    This is convenience method to allow plugging in blocks of group operations similar to RichPipe.thenDo

    This is convenience method to allow plugging in blocks of group operations similar to RichPipe.thenDo

    Definition Classes
    GroupBuilder
  82. def times[T](fs: Symbol*)(implicit ring: Ring[T], tconv: TupleConverter[T], tset: TupleSetter[T]): GroupBuilder

    Permalink

    The same as times(fs -> fs)

    The same as times(fs -> fs)

    Definition Classes
    ReduceOperations
  83. def times[T](fd: (Fields, Fields))(implicit ring: Ring[T], tconv: TupleConverter[T], tset: TupleSetter[T]): GroupBuilder

    Permalink

    Returns the product of all the items in this grouping

    Returns the product of all the items in this grouping

    Definition Classes
    ReduceOperations
  84. def toList[T](fieldDef: (Fields, Fields))(implicit conv: TupleConverter[T]): GroupBuilder

    Permalink

    Convert a subset of fields into a list of Tuples.

    Convert a subset of fields into a list of Tuples. Need to provide the types of the tuple fields.

    Definition Classes
    ReduceOperations
  85. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  86. def using[C <: AnyRef { def release(): Unit }](bf: ⇒ C): AnyRef { def mapStream[T, X](fieldDef: (cascading.tuple.Fields, cascading.tuple.Fields))(mapfn: (C, Iterator[T]) => TraversableOnce[X])(implicit conv: com.twitter.scalding.TupleConverter[T],implicit setter: com.twitter.scalding.TupleSetter[X]): com.twitter.scalding.GroupBuilder }

    Permalink

    beginning of block with access to expensive nonserializable state.

    beginning of block with access to expensive nonserializable state. The state object should contain a function release() for resource management purpose.

    Definition Classes
    GroupBuilder
  87. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  88. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  89. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from GroupBuilder

Inherited from StreamOperations[GroupBuilder]

Inherited from FoldOperations[GroupBuilder]

Inherited from Sortable[GroupBuilder]

Inherited from ReduceOperations[GroupBuilder]

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped