Class/Object

com.twitter.scalding

RichPipe

Related Docs: object RichPipe | package scalding

Permalink

class RichPipe extends Serializable with JoinAlgorithms

This is an enrichment-pattern class for cascading.pipe.Pipe. The rule is to never use this class directly in input or return types, but only to add methods to Pipe.

Source
RichPipe.scala
Linear Supertypes
JoinAlgorithms, Serializable, AnyRef, Any
Type Hierarchy
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. RichPipe
  2. JoinAlgorithms
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new RichPipe(pipe: Pipe)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. def ++(that: Pipe): Pipe

    Permalink

    Merge or Concatenate several pipes together with this one:

  4. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  5. def addTrap(trapsource: Source)(implicit flowDef: FlowDef, mode: Mode): Pipe

    Permalink

    Adds a trap to the current pipe, which will capture all exceptions that occur in this pipe and save them to the trapsource given

    Adds a trap to the current pipe, which will capture all exceptions that occur in this pipe and save them to the trapsource given

    Traps do not include the original fields in a tuple, only the fields seen in an operation. Traps also do not include any exception information.

    There can only be at most one trap for each pipe.

  6. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  7. def blockJoinWithSmaller(fs: (Fields, Fields), otherPipe: Pipe, rightReplication: Int = 1, leftReplication: Int = 1, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

    Permalink

    Performs a block join, otherwise known as a replicate fragment join (RF join).

    Performs a block join, otherwise known as a replicate fragment join (RF join). The input params leftReplication and rightReplication control the replication of the left and right pipes respectively.

    This is useful in cases where the data has extreme skew. A symptom of this is that we may see a job stuck for a very long time on a small number of reducers.

    A block join is way to get around this: we add a random integer field and a replica field to every tuple in the left and right pipes. We then join on the original keys and on these new dummy fields. These dummy fields make it less likely that the skewed keys will be hashed to the same reducer.

    The final data size is right * rightReplication + left * leftReplication but because of the fragmentation, we are guaranteed the same number of hits as the original join.

    If the right pipe is really small then you are probably better off with a joinWithTiny. If however the right pipe is medium sized, then you are better off with a blockJoinWithSmaller, and a good rule of thumb is to set rightReplication = left.size / right.size and leftReplication = 1

    Finally, if both pipes are of similar size, e.g. in case of a self join with a high data skew, then it makes sense to set leftReplication and rightReplication to be approximately equal.

    Note

    You can only use an InnerJoin or a LeftJoin with a leftReplication of 1 (or a RightJoin with a rightReplication of 1) when doing a blockJoin.

    Definition Classes
    JoinAlgorithms
  8. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  9. def coGroupBy(f: Fields, j: JoinMode = InnerJoinMode)(builder: (CoGroupBuilder) ⇒ GroupBuilder): Pipe

    Permalink

    This method is used internally to implement all joins.

    This method is used internally to implement all joins. You can use this directly if you want to implement something like a star join, e.g., when joining a single pipe to multiple other pipes. Make sure that you call this method on the larger pipe to make the grouping as efficient as possible.

    If you are only joining two pipes, then you are better off using joinWithSmaller/joinWithLarger/joinWithTiny/leftJoinWithTiny.

    Definition Classes
    JoinAlgorithms
  10. def collect[A, T](fs: (Fields, Fields))(fn: PartialFunction[A, T])(implicit conv: TupleConverter[A], setter: TupleSetter[T]): Pipe

    Permalink

    Filters all data that is defined for this partial function and then applies that function

  11. def collectTo[A, T](fs: (Fields, Fields))(fn: PartialFunction[A, T])(implicit conv: TupleConverter[A], setter: TupleSetter[T]): Pipe

    Permalink
  12. def crossWithSmaller(p: Pipe, replication: Int = 20): Pipe

    Permalink

    Does a cross-product by doing a blockJoin.

    Does a cross-product by doing a blockJoin. Useful when doing a large cross, if your cluster can take it. Prefer crossWithTiny

    Definition Classes
    JoinAlgorithms
  13. def crossWithTiny(tiny: Pipe): Pipe

    Permalink

    Doing a cross product with even a moderate sized pipe can create ENORMOUS output.

    WARNING

    Doing a cross product with even a moderate sized pipe can create ENORMOUS output. The use-case here is attaching a constant (e.g. a number or a dictionary or set) to each row in another pipe. A common use-case comes from a groupAll and reduction to one row, then you want to send the results back out to every element in a pipe

    This uses joinWithTiny, so tiny pipe is replicated to all Mappers. If it is large, this will blow up. Get it: be foolish here and LOSE IT ALL!

    Use at your own risk.

    Definition Classes
    JoinAlgorithms
  14. def debug(dbg: PipeDebug): Pipe

    Permalink

    Print the tuples that pass with the options configured in debugger For instance:

    Print the tuples that pass with the options configured in debugger For instance:

    debug(PipeDebug().toStdOut.printTuplesEvery(100))
  15. def debug: Pipe

    Permalink

    Print all the tuples that pass to stderr

  16. def discard(f: Fields): Pipe

    Permalink

    Discard the given fields, and keep the rest.

    Discard the given fields, and keep the rest. Kind of the opposite of project method.

  17. def distinct(f: Fields): Pipe

    Permalink

    Returns the set of distinct tuples containing the specified fields

  18. def each(fs: (Fields, Fields))(fn: (Fields) ⇒ Function[_]): Each

    Permalink

    Convenience method for integrating with existing cascading Functions

  19. def eachTo(fs: (Fields, Fields))(fn: (Fields) ⇒ Function[_]): Each

    Permalink

    Same as above, but only keep the results field.

  20. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  21. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  22. def filter[A](f: Fields)(fn: (A) ⇒ Boolean)(implicit conv: TupleConverter[A]): Pipe

    Permalink

    Keep only items that satisfy this predicate.

  23. def filterNot[A](f: Fields)(fn: (A) ⇒ Boolean)(implicit conv: TupleConverter[A]): Pipe

    Permalink

    Keep only items that don't satisfy this predicate.

    Keep only items that don't satisfy this predicate. filterNot is equal to negating a filter operation.

    filterNot('name) { name: String => name contains "a" }

    is the same as:

    filter('name) { name: String => !(name contains "a") }
  24. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  25. def flatMap[A, T](fs: (Fields, Fields))(fn: (A) ⇒ TraversableOnce[T])(implicit conv: TupleConverter[A], setter: TupleSetter[T]): Pipe

    Permalink
  26. def flatMapTo[A, T](fs: (Fields, Fields))(fn: (A) ⇒ TraversableOnce[T])(implicit conv: TupleConverter[A], setter: TupleSetter[T]): Pipe

    Permalink
  27. def flatten[T](fs: (Fields, Fields))(implicit conv: TupleConverter[TraversableOnce[T]], setter: TupleSetter[T]): Pipe

    Permalink

    the same as

    the same as

    flatMap(fs) { it : TraversableOnce[T] => it }

    Common enough to be useful.

  28. def flattenTo[T](fs: (Fields, Fields))(implicit conv: TupleConverter[TraversableOnce[T]], setter: TupleSetter[T]): Pipe

    Permalink

    the same as

    the same as

    flatMapTo(fs) { it : TraversableOnce[T] => it }

    Common enough to be useful.

  29. lazy val forceToDisk: Pipe

    Permalink

    Force a materialization to disk in the flow.

    Force a materialization to disk in the flow. This is useful before crossWithTiny if you filter just before. Ideally scalding/cascading would see this (and may in future versions), but for now it is here to aid in hand-tuning jobs

  30. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  31. def groupAll(gs: (GroupBuilder) ⇒ GroupBuilder): Pipe

    Permalink

    This kills parallelism.

    Warning

    This kills parallelism. All the work is sent to one reducer.

    Only use this in the case that you truly need all the data on one reducer.

    Just about the only reasonable case of this method is to reduce all values of a column or count all the rows.

  32. def groupAll: Pipe

    Permalink

    Group all tuples down to one reducer.

    Group all tuples down to one reducer. (due to cascading limitation). This is probably only useful just before setting a tail such as Database tail, so that only one reducer talks to the DB. Kind of a hack.

  33. def groupAndShuffleRandomly(reducers: Int, seed: Long)(gs: (GroupBuilder) ⇒ GroupBuilder): Pipe

    Permalink

    Like groupAndShuffleRandomly(reducers : Int) but with a fixed seed.

  34. def groupAndShuffleRandomly(reducers: Int)(gs: (GroupBuilder) ⇒ GroupBuilder): Pipe

    Permalink

    Like shard, except do some operation im the reducers

  35. def groupBy(f: Fields)(builder: (GroupBuilder) ⇒ GroupBuilder): Pipe

    Permalink

    group the Pipe based on fields

    group the Pipe based on fields

    builder is typically a block that modifies the given GroupBuilder the final OUTPUT of the block is used to schedule the new pipe each method in GroupBuilder returns this, so it is recommended to chain them and use the default input:

    _.size.max('f1) etc...
  36. def groupRandomly(n: Int, seed: Long)(gs: (GroupBuilder) ⇒ GroupBuilder): Pipe

    Permalink

    like groupRandomly(n : Int) with a given seed in the randomization

  37. def groupRandomly(n: Int)(gs: (GroupBuilder) ⇒ GroupBuilder): Pipe

    Permalink

    Like groupAll, but randomly groups data into n reducers.

    Like groupAll, but randomly groups data into n reducers.

    you can provide a seed for the random number generator to get reproducible results

  38. def groupRandomlyAux(n: Int, optSeed: Option[Long])(gs: (GroupBuilder) ⇒ GroupBuilder): Pipe

    Permalink
    Attributes
    protected
  39. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  40. def insert[A](fs: Fields, value: A)(implicit setter: TupleSetter[A]): Pipe

    Permalink

    Adds a field with a constant value.

    Adds a field with a constant value.

    Usage

    insert('a, 1)
  41. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  42. def joinWithLarger(fs: (Fields, Fields), that: Pipe, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

    Permalink

    same as reversing the order on joinWithSmaller

    same as reversing the order on joinWithSmaller

    Definition Classes
    JoinAlgorithms
  43. def joinWithSmaller(fs: (Fields, Fields), that: Pipe, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

    Permalink

    Joins the first set of keys in the first pipe to the second set of keys in the second pipe.

    Joins the first set of keys in the first pipe to the second set of keys in the second pipe. All keys must be unique UNLESS it is an inner join, then duplicated join keys are allowed, but the second copy is deleted (as cascading does not allow duplicated field names).

    Smaller here means that the values/key is smaller than the left.

    Avoid going crazy adding more explicit join modes. Instead do for some other join mode with a larger pipe:

    .then { pipe => other.
              joinWithSmaller(('other1, 'other2)->('this1, 'this2), pipe, new FancyJoin)
          }
    Definition Classes
    JoinAlgorithms
  44. def joinWithTiny(fs: (Fields, Fields), that: Pipe): Pipe

    Permalink

    This does an assymmetric join, using cascading's "HashJoin".

    This does an assymmetric join, using cascading's "HashJoin". This only runs through this pipe once, and keeps the right hand side pipe in memory (but is spillable).

    Choose this when Left > max(mappers,reducers) * Right, or when the left side is three orders of magnitude larger.

    joins the first set of keys in the first pipe to the second set of keys in the second pipe. Duplicated join keys are allowed, but the second copy is deleted (as cascading does not allow duplicated field names).

    Warning

    This does not work with outer joins, or right joins, only inner and left join versions are given.

    Definition Classes
    JoinAlgorithms
  45. def joinerToJoinModes(j: Joiner): (Product with Serializable with JoinMode, Product with Serializable with JoinMode)

    Permalink
    Definition Classes
    JoinAlgorithms
  46. def leftJoinWithLarger(fs: (Fields, Fields), that: Pipe, reducers: Int = 1): Pipe

    Permalink

    This is joinWithLarger with joiner parameter fixed to LeftJoin.

    This is joinWithLarger with joiner parameter fixed to LeftJoin. If the item is absent on the right put null for the keys and values

    Definition Classes
    JoinAlgorithms
  47. def leftJoinWithSmaller(fs: (Fields, Fields), that: Pipe, reducers: Int = 1): Pipe

    Permalink

    This is joinWithSmaller with joiner parameter fixed to LeftJoin.

    This is joinWithSmaller with joiner parameter fixed to LeftJoin. If the item is absent on the right put null for the keys and values

    Definition Classes
    JoinAlgorithms
  48. def leftJoinWithTiny(fs: (Fields, Fields), that: Pipe): HashJoin

    Permalink
    Definition Classes
    JoinAlgorithms
  49. def limit(n: Long): Pipe

    Permalink

    Keep at most n elements.

    Keep at most n elements. This is implemented by keeping approximately n/k elements on each of the k mappers or reducers (whichever we wind up being scheduled on).

  50. def map[A, T](fs: (Fields, Fields))(fn: (A) ⇒ T)(implicit conv: TupleConverter[A], setter: TupleSetter[T]): Pipe

    Permalink

    If you use a map function that does not accept TupleEntry args, which is the common case, an implicit conversion in GeneratedConversions will convert your function into a (TupleEntry => T).

    If you use a map function that does not accept TupleEntry args, which is the common case, an implicit conversion in GeneratedConversions will convert your function into a (TupleEntry => T). The result type T is converted to a cascading Tuple by an implicit TupleSetter[T]. acceptable T types are primitive types, cascading Tuples of those types, or scala.Tuple(1-22) of those types.

    After the map, the input arguments will be set to the output of the map, so following with filter or map is fine without a new using statement if you mean to operate on the output.

    map('data -> 'stuff)

    * if output equals input, REPLACE is used. * if output or input is a subset of the other SWAP is used. * otherwise we append the new fields (cascading Fields.ALL is used)

    mapTo('data -> 'stuff)

    Only the results (stuff) are kept (cascading Fields.RESULTS)

    Note

    Using mapTo is the same as using map followed by a project for selecting just the output fields

  51. def mapTo[A, T](fs: (Fields, Fields))(fn: (A) ⇒ T)(implicit conv: TupleConverter[A], setter: TupleSetter[T]): Pipe

    Permalink
  52. def name(s: String): Pipe

    Permalink

    Rename the current pipe

  53. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  54. def normalize(f: Fields, useTiny: Boolean = true): Pipe

    Permalink

    Divides sum of values for this variable by their sum; assumes without checking that division is supported on this type and that sum is not zero

    Divides sum of values for this variable by their sum; assumes without checking that division is supported on this type and that sum is not zero

    If those assumptions do not hold, will throw an exception -- consider checking sum sepsarately and/or using addTrap

    in some cases, crossWithTiny has been broken, the implementation supports a work-around

  55. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  56. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  57. def pack[T](fs: (Fields, Fields))(implicit packer: TuplePacker[T], setter: TupleSetter[T]): Pipe

    Permalink

    Maps the input fields into an output field of type T.

    Maps the input fields into an output field of type T. For example:

    pipe.pack[(Int, Int)] (('field1, 'field2) -> 'field3)

    will pack fields 'field1 and 'field2 to field 'field3, as long as 'field1 and 'field2 can be cast into integers. The output field 'field3 will be of tupel (Int, Int)

  58. def packTo[T](fs: (Fields, Fields))(implicit packer: TuplePacker[T], setter: TupleSetter[T]): Pipe

    Permalink

    Same as pack but only the to fields are preserved.

  59. def partition[A, R](fs: (Fields, Fields))(fn: (A) ⇒ R)(builder: (GroupBuilder) ⇒ GroupBuilder)(implicit conv: TupleConverter[A], ord: Ordering[R], rset: TupleSetter[R]): Pipe

    Permalink

    Given a function, partitions the pipe into several groups based on the output of the function.

    Given a function, partitions the pipe into several groups based on the output of the function. Then applies a GroupBuilder function on each of the groups.

    Example: pipe .mapTo(()->('age, 'weight) { ... } .partition('age -> 'isAdult) { _ > 18 } { _.average('weight) } pipe now contains the average weights of adults and minors.

  60. val pipe: Pipe

    Permalink
    Definition Classes
    RichPipeJoinAlgorithms
  61. def project(fields: Fields): Pipe

    Permalink

    Keep only the given fields, and discard the rest.

    Keep only the given fields, and discard the rest. takes any number of parameters as long as we can convert them to a fields object

  62. def rename(fields: (Fields, Fields)): Pipe

    Permalink

    Rename some set of N fields as another set of N fields

    Rename some set of N fields as another set of N fields

    Usage

    rename('x -> 'z)
           rename(('x,'y) -> ('X,'Y))

    Warning

    rename('x,'y) is interpreted by scala as rename(Tuple2('x,'y)) which then does rename('x -> 'y). This is probably not what is intended but the compiler doesn't resolve the ambiguity. YOU MUST CALL THIS WITH A TUPLE2! If you don't, expect the unexpected.

  63. def sample(fraction: Double, seed: Long): Pipe

    Permalink
  64. def sample(fraction: Double): Pipe

    Permalink

    Sample a fraction of elements.

    Sample a fraction of elements. fraction should be between 0.00 (0%) and 1.00 (100%) you can provide a seed to get reproducible results

  65. def sampleWithReplacement(fraction: Double, seed: Int): Pipe

    Permalink
  66. def sampleWithReplacement(fraction: Double): Pipe

    Permalink

    Sample fraction of elements with return.

    Sample fraction of elements with return. fraction should be between 0.00 (0%) and 1.00 (100%) you can provide a seed to get reproducible results

  67. def shard(n: Int, seed: Int): Pipe

    Permalink

    Force a random shuffle of all the data to exactly n reducers, with a given seed if you need repeatability.

  68. def shard(n: Int): Pipe

    Permalink

    Force a random shuffle of all the data to exactly n reducers

  69. def shuffle(shards: Int, seed: Long): Pipe

    Permalink
  70. def shuffle(shards: Int): Pipe

    Permalink

    Put all rows in random order

    Put all rows in random order

    you can provide a seed for the random number generator to get reproducible results

  71. def skewJoinWithLarger(fs: (Fields, Fields), otherPipe: Pipe, sampleRate: Double = 0.001, reducers: Int = 1, replicator: SkewReplication = SkewReplicationA()): Pipe

    Permalink
    Definition Classes
    JoinAlgorithms
  72. def skewJoinWithSmaller(fs: (Fields, Fields), otherPipe: Pipe, sampleRate: Double = 0.001, reducers: Int = 1, replicator: SkewReplication = SkewReplicationA()): Pipe

    Permalink

    Performs a skewed join, which is useful when the data has extreme skew.

    Performs a skewed join, which is useful when the data has extreme skew.

    For example, imagine joining a pipe of Twitter's follow graph against a pipe of user genders, in order to find the gender distribution of the accounts every Twitter user follows. Since celebrities (e.g., Justin Bieber and Lady Gaga) have a much larger follower base than other users, and (under a standard join algorithm) all their followers get sent to the same reducer, the job will likely be stuck on a few reducers for a long time. A skewed join attempts to alleviate this problem.

    This works as follows:

    1. First, we sample from the left and right pipes with some small probability, in order to determine approximately how often each join key appears in each pipe. 2. We use these estimated counts to replicate the join keys, according to the given replication strategy. 3. Finally, we join the replicated pipes together.

    sampleRate

    This controls how often we sample from the left and right pipes when estimating key counts.

    replicator

    Algorithm for determining how much to replicate a join key in the left and right pipes. Note: since we do not set the replication counts, only inner joins are allowed. (Otherwise, replicated rows would stay replicated when there is no counterpart in the other pipe.)

    Definition Classes
    JoinAlgorithms
  73. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  74. def thenDo[T, U](pfn: (T) ⇒ U)(implicit in: (RichPipe) ⇒ T): U

    Permalink

    Insert a function into the pipeline:

  75. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  76. def unique(f: Fields): Pipe

    Permalink

    Returns the set of unique tuples containing the specified fields.

    Returns the set of unique tuples containing the specified fields. Same as distinct

  77. def unpack[T](fs: (Fields, Fields))(implicit unpacker: TupleUnpacker[T], conv: TupleConverter[T]): Pipe

    Permalink

    The opposite of pack.

    The opposite of pack. Unpacks the input field of type T into the output fields. For example:

    pipe.unpack[(Int, Int)] ('field1 -> ('field2, 'field3))

    will unpack 'field1 into 'field2 and 'field3

  78. def unpackTo[T](fs: (Fields, Fields))(implicit unpacker: TupleUnpacker[T], conv: TupleConverter[T]): Pipe

    Permalink

    Same as unpack but only the to fields are preserved.

  79. def unpivot(fieldDef: (Fields, Fields)): Pipe

    Permalink

    This is an analog of the SQL/Excel unpivot function which converts columns of data into rows of data.

    This is an analog of the SQL/Excel unpivot function which converts columns of data into rows of data. Only the columns given as input fields are expanded in this way. For this operation to be reversible, you need to keep some unique key on each row. See GroupBuilder.pivot to reverse this operation assuming you leave behind a grouping key

    Example

    pipe.unpivot(('w,'x,'y,'z) -> ('feature, 'value))

    takes rows like:

    key, w, x, y, z
    1, 2, 3, 4, 5
    2, 8, 7, 6, 5

    to:

    key, feature, value
    1, w, 2
    1, x, 3
    1, y, 4

    etc...

  80. def upstreamPipes: Set[Pipe]

    Permalink

    Set of pipes reachable from this pipe (transitive closure of 'Pipe.getPrevious')

  81. def using[C <: AnyRef { def release(): Unit }](bf: ⇒ C): AnyRef { ... /* 3 definitions in type refinement */ }

    Permalink

    Beginning of block with access to expensive nonserializable state.

    Beginning of block with access to expensive nonserializable state. The state object should contain a function release() for resource management purpose.

  82. def verifyTypes[A](f: Fields)(implicit conv: TupleConverter[A]): Pipe

    Permalink

    Text files can have corrupted data.

    Text files can have corrupted data. If you use this function and a cascading trap you can filter out corrupted data from your pipe.

  83. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  84. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  85. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  86. def write(outsource: Source)(implicit flowDef: FlowDef, mode: Mode): Pipe

    Permalink

    Write all the tuples to the given source and return this Pipe

Inherited from JoinAlgorithms

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped