Trait/Object

com.twitter.scalding.typed

TypedPipe

Related Docs: object TypedPipe | package typed

Permalink

trait TypedPipe[+T] extends Serializable

Think of a TypedPipe as a distributed unordered list that may or may not yet have been materialized in memory or disk.

Represents a phase in a distributed computation on an input data source Wraps a cascading Pipe object, and holds the transformation done up until that point

Source
TypedPipe.scala
Linear Supertypes
Serializable, AnyRef, Any
Known Subclasses
Type Hierarchy
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. TypedPipe
  2. Serializable
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Abstract Value Members

  1. abstract def cross[U](tiny: TypedPipe[U]): TypedPipe[(T, U)]

    Permalink

    Implements a cross product.

    Implements a cross product. The right side should be tiny This gives the same results as {code for { l <- list1; l2 <- list2 } yield (l, l2) }

  2. abstract def flatMap[U](f: (T) ⇒ TraversableOnce[U]): TypedPipe[U]

    Permalink

    This is the fundamental mapper operation.

    This is the fundamental mapper operation. It behaves in a way similar to List.flatMap, which means that each item is fed to the input function, which can return 0, 1, or many outputs (as a TraversableOnce) per input. The returned results will be iterated through once and then flattened into a single TypedPipe which is passed to the next step in the pipeline.

    This behavior makes it a powerful operator -- it can be used to filter records (by returning 0 items for a given input), it can be used the way map is used (by returning 1 item per input), it can be used to explode 1 input into many outputs, or even a combination of all of the above at once.

Concrete Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. def ++[U >: T](other: TypedPipe[U]): TypedPipe[U]

    Permalink

    Merge two TypedPipes (no order is guaranteed) This is only realized when a group (or join) is performed.

  4. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  5. def addTrap[U >: T](trapSink: Source with TypedSink[T])(implicit conv: TupleConverter[U]): TypedPipe[U]

    Permalink

    If any errors happen below this line, but before a groupBy, write to a TypedSink

  6. def aggregate[B, C](agg: Aggregator[T, B, C]): ValuePipe[C]

    Permalink

    Aggregate all items in this pipe into a single ValuePipe

    Aggregate all items in this pipe into a single ValuePipe

    Aggregators are composable reductions that allow you to glue together several reductions and process them in one pass.

    Same as groupAll.aggregate.values

  7. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  8. def asKeys[U >: T](implicit ord: Ordering[U]): Grouped[U, Unit]

    Permalink

    Put the items in this into the keys, and unit as the value in a Group in some sense, this is the dual of groupAll

    Put the items in this into the keys, and unit as the value in a Group in some sense, this is the dual of groupAll

    Annotations
    @implicitNotFound( ... )
  9. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  10. def collect[U](fn: PartialFunction[T, U]): TypedPipe[U]

    Permalink

    Filter and map.

    Filter and map. See scala.collection.List.collect. collect { case Some(x) => fn(x) }

  11. def cross[V](p: ValuePipe[V]): TypedPipe[(T, V)]

    Permalink

    Attach a ValuePipe to each element this TypedPipe

  12. def debug: TypedPipe[T]

    Permalink

    prints the current pipe to stdout

  13. def distinct(implicit ord: Ordering[_ >: T]): TypedPipe[T]

    Permalink

    Returns the set of distinct elements in the TypedPipe This is the same as: .map((_, ())).group.sum.keys If you want a distinct while joining, consider: instead of: a.join(b.distinct.asKeys) manually do the distinct: a.join(b.asKeys.sum) The latter creates 1 map/reduce phase rather than 2

    Returns the set of distinct elements in the TypedPipe This is the same as: .map((_, ())).group.sum.keys If you want a distinct while joining, consider: instead of: a.join(b.distinct.asKeys) manually do the distinct: a.join(b.asKeys.sum) The latter creates 1 map/reduce phase rather than 2

    Annotations
    @implicitNotFound( ... )
  14. def distinctBy[U](fn: (T) ⇒ U, numReducers: Option[Int] = None)(implicit ord: Ordering[_ >: U]): TypedPipe[T]

    Permalink

    Returns the set of distinct elements identified by a given lambda extractor in the TypedPipe

    Returns the set of distinct elements identified by a given lambda extractor in the TypedPipe

    Annotations
    @implicitNotFound( ... )
  15. def either[R](that: TypedPipe[R]): TypedPipe[Either[T, R]]

    Permalink

    Merge two TypedPipes of different types by using Either

  16. def eitherValues[K, V, R](that: TypedPipe[(K, R)])(implicit ev: <:<[T, (K, V)]): TypedPipe[(K, Either[V, R])]

    Permalink

    Sometimes useful for implementing custom joins with groupBy + mapValueStream when you know that the value/key can fit in memory.

    Sometimes useful for implementing custom joins with groupBy + mapValueStream when you know that the value/key can fit in memory. Beware.

  17. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  18. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  19. def filter(f: (T) ⇒ Boolean): TypedPipe[T]

    Permalink

    Keep only items that satisfy this predicate

  20. def filterKeys[K](fn: (K) ⇒ Boolean)(implicit ev: <:<[T, (K, Any)]): TypedPipe[T]

    Permalink

    If T is a (K, V) for some V, then we can use this function to filter.

    If T is a (K, V) for some V, then we can use this function to filter. Prefer to use this if your filter only touches the key.

    This is here to match the function in KeyedListLike, where it is optimized

  21. def filterNot(f: (T) ⇒ Boolean): TypedPipe[T]

    Permalink

    Keep only items that don't satisfy the predicate.

    Keep only items that don't satisfy the predicate. filterNot is the same as filter with a negated predicate.

  22. def filterWithValue[U](value: ValuePipe[U])(f: (T, Option[U]) ⇒ Boolean): TypedPipe[T]

    Permalink

    common pattern of attaching a value and then filter recommended style: filterWithValue(vpu) { case (t, Some(u)) => op(t, u) case (t, None) => // if you never expect this: sys.error("unexpected empty value pipe") }

    common pattern of attaching a value and then filter recommended style: filterWithValue(vpu) { case (t, Some(u)) => op(t, u) case (t, None) => // if you never expect this: sys.error("unexpected empty value pipe") }

  23. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  24. def flatMapValues[K, V, U](f: (V) ⇒ TraversableOnce[U])(implicit ev: <:<[T, (K, V)]): TypedPipe[(K, U)]

    Permalink

    Similar to mapValues, but allows to return a collection of outputs for each input value

  25. def flatMapWithValue[U, V](value: ValuePipe[U])(f: (T, Option[U]) ⇒ TraversableOnce[V]): TypedPipe[V]

    Permalink

    common pattern of attaching a value and then flatMap recommended style: flatMapWithValue(vpu) { case (t, Some(u)) => op(t, u) case (t, None) => // if you never expect this: sys.error("unexpected empty value pipe") }

    common pattern of attaching a value and then flatMap recommended style: flatMapWithValue(vpu) { case (t, Some(u)) => op(t, u) case (t, None) => // if you never expect this: sys.error("unexpected empty value pipe") }

  26. def flatten[U](implicit ev: <:<[T, TraversableOnce[U]]): TypedPipe[U]

    Permalink

    flatten an Iterable

  27. def flattenValues[K, U](implicit ev: <:<[T, (K, TraversableOnce[U])]): TypedPipe[(K, U)]

    Permalink

    flatten just the values This is more useful on KeyedListLike, but added here to reduce assymmetry in the APIs

  28. def forceToDisk: TypedPipe[T]

    Permalink

    Force a materialization of this pipe prior to the next operation.

    Force a materialization of this pipe prior to the next operation. This is useful if you filter almost everything before a hashJoin, for instance. This is useful for experts who see some heuristic of the planner causing slower performance.

  29. def forceToDiskExecution: Execution[TypedPipe[T]]

    Permalink

    This is used when you are working with Execution[T] to create loops.

    This is used when you are working with Execution[T] to create loops. You might do this to checkpoint and then flatMap Execution to continue from there. Probably only useful if you need to flatMap it twice to fan out the data into two children jobs.

    This writes the current TypedPipe into a temporary file and then opens it after complete so that you can continue from that point

  30. def fork: TypedPipe[T]

    Permalink

    If you are going to create two branches or forks, it may be more efficient to call this method first which will create a node in the cascading graph.

    If you are going to create two branches or forks, it may be more efficient to call this method first which will create a node in the cascading graph. Without this, both full branches of the fork will be put into separate cascading pipes, which can, in some cases, be slower.

    Ideally the planner would see this

  31. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  32. def group[K, V](implicit ev: <:<[T, (K, V)], ord: Ordering[K]): Grouped[K, V]

    Permalink

    This is the default means of grouping all pairs with the same key.

    This is the default means of grouping all pairs with the same key. Generally this triggers 1 Map/Reduce transition

  33. def groupAll: Grouped[Unit, T]

    Permalink

    Send all items to a single reducer

  34. def groupBy[K](g: (T) ⇒ K)(implicit ord: Ordering[K]): Grouped[K, T]

    Permalink

    Given a key function, add the key, then call .group

  35. def groupRandomly(partitions: Int): Grouped[Int, T]

    Permalink

    Forces a shuffle by randomly assigning each item into one of the partitions.

    Forces a shuffle by randomly assigning each item into one of the partitions.

    This is for the case where you mappers take a long time, and it is faster to shuffle them to more reducers and then operate.

    You probably want shard if you are just forcing a shuffle.

  36. def groupWith[K, V](ord: Ordering[K])(implicit ev: <:<[T, (K, V)]): Grouped[K, V]

    Permalink

    Group using an explicit Ordering on the key.

  37. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  38. def hashCogroup[K, V, W, R](smaller: HashJoinable[K, W])(joiner: (K, V, Iterable[W]) ⇒ Iterator[R])(implicit ev: <:<[TypedPipe[T], TypedPipe[(K, V)]]): TypedPipe[(K, R)]

    Permalink

    These operations look like joins, but they do not force any communication of the current TypedPipe.

    These operations look like joins, but they do not force any communication of the current TypedPipe. They are mapping operations where this pipe is streamed through one item at a time.

    WARNING These behave semantically very differently than cogroup. This is because we handle (K,V) tuples on the left as we see them. The iterable on the right is over all elements with a matching key K, and it may be empty if there are no values for this key K.

  39. def hashJoin[K, V, W](smaller: HashJoinable[K, W])(implicit ev: <:<[TypedPipe[T], TypedPipe[(K, V)]]): TypedPipe[(K, (V, W))]

    Permalink

    Do an inner-join without shuffling this TypedPipe, but replicating argument to all tasks

  40. def hashLeftJoin[K, V, W](smaller: HashJoinable[K, W])(implicit ev: <:<[TypedPipe[T], TypedPipe[(K, V)]]): TypedPipe[(K, (V, Option[W]))]

    Permalink

    Do an leftjoin without shuffling this TypedPipe, but replicating argument to all tasks

  41. def hashLookup[K >: T, V](grouped: HashJoinable[K, V]): TypedPipe[(K, Option[V])]

    Permalink

    For each element, do a map-side (hash) left join to look up a value

  42. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  43. def keys[K](implicit ev: <:<[T, (K, Any)]): TypedPipe[K]

    Permalink

    Just keep the keys, or ._1 (if this type is a Tuple2)

  44. def leftCross[V](thatPipe: TypedPipe[V]): TypedPipe[(T, Option[V])]

    Permalink

    uses hashJoin but attaches None if thatPipe is empty

  45. def leftCross[V](p: ValuePipe[V]): TypedPipe[(T, Option[V])]

    Permalink

    ValuePipe may be empty, so, this attaches it as an Option cross is the same as leftCross(p).collect { case (t, Some(v)) => (t, v) }

  46. def limit(count: Int): TypedPipe[T]

    Permalink

    limit the output to at most count items, if at least count items exist.

  47. def make[U >: T](dest: Source with TypedSink[T] with TypedSource[U]): Execution[TypedPipe[U]]

    Permalink

    If you want to writeThrough to a specific file if it doesn't already exist, and otherwise just read from it going forward, use this.

  48. def map[U](f: (T) ⇒ U): TypedPipe[U]

    Permalink

    Transform each element via the function f

  49. def mapValues[K, V, U](f: (V) ⇒ U)(implicit ev: <:<[T, (K, V)]): TypedPipe[(K, U)]

    Permalink

    Transform only the values (sometimes requires giving the types due to scala type inference)

  50. def mapWithValue[U, V](value: ValuePipe[U])(f: (T, Option[U]) ⇒ V): TypedPipe[V]

    Permalink

    common pattern of attaching a value and then map recommended style: mapWithValue(vpu) { case (t, Some(u)) => op(t, u) case (t, None) => // if you never expect this: sys.error("unexpected empty value pipe") }

    common pattern of attaching a value and then map recommended style: mapWithValue(vpu) { case (t, Some(u)) => op(t, u) case (t, None) => // if you never expect this: sys.error("unexpected empty value pipe") }

  51. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  52. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  53. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  54. def onComplete(fn: () ⇒ Unit): TypedPipe[T]

    Permalink

    This attaches a function that is called at the end of the map phase on EACH of the tasks that are executing.

    This attaches a function that is called at the end of the map phase on EACH of the tasks that are executing. This is for expert use only. You probably won't ever need it. Try hard to avoid it. Execution also has onComplete that can run when an Execution has completed.

  55. def onRawSingle(onPipe: (Pipe) ⇒ Pipe): TypedPipe[T]

    Permalink
    Attributes
    protected
  56. def partition(p: (T) ⇒ Boolean): (TypedPipe[T], TypedPipe[T])

    Permalink

    Partitions this into two pipes according to a predicate.

    Partitions this into two pipes according to a predicate.

    Sometimes what you really want is a groupBy in these cases.

  57. def raiseTo[U](implicit ev: <:<[T, U]): TypedPipe[U]

    Permalink

    If T <:< U, then this is safe to treat as TypedPipe[U] due to covariance

    If T <:< U, then this is safe to treat as TypedPipe[U] due to covariance

    Attributes
    protected
  58. def sample(fraction: Double, seed: Long): TypedPipe[T]

    Permalink

    Sample a fraction (between 0 and 1) uniformly independently at random each element of the pipe with a given seed.

    Sample a fraction (between 0 and 1) uniformly independently at random each element of the pipe with a given seed. Does not require a reduce step.

  59. def sample(fraction: Double): TypedPipe[T]

    Permalink

    Sample a fraction (between 0 and 1) uniformly independently at random each element of the pipe does not require a reduce step.

  60. def shard(partitions: Int): TypedPipe[T]

    Permalink

    Used to force a shuffle into a given size of nodes.

    Used to force a shuffle into a given size of nodes. Only use this if your mappers are taking far longer than the time to shuffle.

  61. def sketch[K, V](reducers: Int, eps: Double = 1.0E-5, delta: Double = 0.01, seed: Int = 12345)(implicit ev: <:<[TypedPipe[T], TypedPipe[(K, V)]], serialization: (K) ⇒ Array[Byte], ordering: Ordering[K]): Sketched[K, V]

    Permalink

    Enables joining when this TypedPipe has some keys with many many values and but many with very few values.

    Enables joining when this TypedPipe has some keys with many many values and but many with very few values. For instance, a graph where some nodes have millions of neighbors, but most have only a few.

    We build a (count-min) sketch of each key's frequency, and we use that to shard the heavy keys across many reducers. This increases communication cost in order to reduce the maximum time needed to complete the join.

    pipe.sketch(100).join(thatPipe) will add an extra map/reduce job over a standard join to create the count-min-sketch. This will generally only be beneficial if you have really heavy skew, where without this you have 1 or 2 reducers taking hours longer than the rest.

  62. def sum[U >: T](implicit plus: Semigroup[U]): ValuePipe[U]

    Permalink

    Reasonably common shortcut for cases of total associative/commutative reduction returns a ValuePipe with only one element if there is any input, otherwise EmptyValue.

  63. def sumByKey[K, V](implicit ev: <:<[T, (K, V)], ord: Ordering[K], plus: Semigroup[V]): UnsortedGrouped[K, V]

    Permalink

    Reasonably common shortcut for cases of associative/commutative reduction by Key

  64. def sumByLocalKeys[K, V](implicit ev: <:<[T, (K, V)], sg: Semigroup[V]): TypedPipe[(K, V)]

    Permalink

    This does a sum of values WITHOUT triggering a shuffle.

    This does a sum of values WITHOUT triggering a shuffle. the contract is, if followed by a group.sum the result is the same with or without this present, and it never increases the number of items. BUT due to the cost of caching, it might not be faster if there is poor key locality.

    It is only useful for expert tuning, and best avoided unless you are struggling with performance problems. If you are not sure you need this, you probably don't.

    The main use case is to reduce the values down before a key expansion such as is often done in a data cube.

  65. def swap[K, V](implicit ev: <:<[T, (K, V)]): TypedPipe[(V, K)]

    Permalink

    swap the keys with the values

  66. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  67. def toIterableExecution: Execution[Iterable[T]]

    Permalink

    This gives an Execution that when run evaluates the TypedPipe, writes it to disk, and then gives you an Iterable that reads from disk on the submit node each time .iterator is called.

    This gives an Execution that when run evaluates the TypedPipe, writes it to disk, and then gives you an Iterable that reads from disk on the submit node each time .iterator is called. Because of how scala Iterables work, mapping/flatMapping/filtering the Iterable forces a read of the entire thing. If you need it to be lazy, call .iterator and use the Iterator inside instead.

  68. final def toPipe[U >: T](fieldNames: Fields)(implicit flowDef: FlowDef, mode: Mode, setter: TupleSetter[U]): Pipe

    Permalink

    Export back to a raw cascading Pipe.

    Export back to a raw cascading Pipe. useful for interop with the scalding Fields API or with Cascading code. Avoid this if possible. Prefer to write to TypedSink.

  69. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  70. def unpackToPipe[U >: T](fieldNames: Fields)(implicit fd: FlowDef, mode: Mode, up: TupleUnpacker[U]): Pipe

    Permalink

    use a TupleUnpacker to flatten U out into a cascading Tuple

  71. def values[V](implicit ev: <:<[T, (Any, V)]): TypedPipe[V]

    Permalink

    Just keep the values, or ._2 (if this type is a Tuple2)

  72. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  73. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  74. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  75. def withDescription(description: String): TypedPipe[T]

    Permalink

    adds a description to the pipe

  76. def withFilter(f: (T) ⇒ Boolean): TypedPipe[T]

    Permalink
  77. def write(dest: TypedSink[T])(implicit flowDef: FlowDef, mode: Mode): TypedPipe[T]

    Permalink

    Safely write to a TypedSink[T].

    Safely write to a TypedSink[T]. If you want to write to a Source (not a Sink) you need to do something like: toPipe(fieldNames).write(dest)

    returns

    a pipe equivalent to the current pipe.

  78. def writeExecution(dest: TypedSink[T]): Execution[Unit]

    Permalink

    This is the functionally pure approach to building jobs.

    This is the functionally pure approach to building jobs. Note, that you have to call run on the result or flatMap/zip it into an Execution that is run for anything to happen here.

  79. def writeThrough[U >: T](dest: TypedSink[T] with TypedSource[U]): Execution[TypedPipe[U]]

    Permalink

    If you want to write to a specific location, and then read from that location going forward, use this.

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped