Trait/Object

com.twitter.scalding

JoinAlgorithms

Related Docs: object JoinAlgorithms | package scalding

Permalink

trait JoinAlgorithms extends AnyRef

Source
JoinAlgorithms.scala
Linear Supertypes
AnyRef, Any
Known Subclasses
Type Hierarchy
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. JoinAlgorithms
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Abstract Value Members

  1. abstract def pipe: Pipe

    Permalink

Concrete Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def blockJoinWithSmaller(fs: (Fields, Fields), otherPipe: Pipe, rightReplication: Int = 1, leftReplication: Int = 1, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

    Permalink

    Performs a block join, otherwise known as a replicate fragment join (RF join).

    Performs a block join, otherwise known as a replicate fragment join (RF join). The input params leftReplication and rightReplication control the replication of the left and right pipes respectively.

    This is useful in cases where the data has extreme skew. A symptom of this is that we may see a job stuck for a very long time on a small number of reducers.

    A block join is way to get around this: we add a random integer field and a replica field to every tuple in the left and right pipes. We then join on the original keys and on these new dummy fields. These dummy fields make it less likely that the skewed keys will be hashed to the same reducer.

    The final data size is right * rightReplication + left * leftReplication but because of the fragmentation, we are guaranteed the same number of hits as the original join.

    If the right pipe is really small then you are probably better off with a joinWithTiny. If however the right pipe is medium sized, then you are better off with a blockJoinWithSmaller, and a good rule of thumb is to set rightReplication = left.size / right.size and leftReplication = 1

    Finally, if both pipes are of similar size, e.g. in case of a self join with a high data skew, then it makes sense to set leftReplication and rightReplication to be approximately equal.

    Note

    You can only use an InnerJoin or a LeftJoin with a leftReplication of 1 (or a RightJoin with a rightReplication of 1) when doing a blockJoin.

  6. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  7. def coGroupBy(f: Fields, j: JoinMode = InnerJoinMode)(builder: (CoGroupBuilder) ⇒ GroupBuilder): Pipe

    Permalink

    This method is used internally to implement all joins.

    This method is used internally to implement all joins. You can use this directly if you want to implement something like a star join, e.g., when joining a single pipe to multiple other pipes. Make sure that you call this method on the larger pipe to make the grouping as efficient as possible.

    If you are only joining two pipes, then you are better off using joinWithSmaller/joinWithLarger/joinWithTiny/leftJoinWithTiny.

  8. def crossWithSmaller(p: Pipe, replication: Int = 20): Pipe

    Permalink

    Does a cross-product by doing a blockJoin.

    Does a cross-product by doing a blockJoin. Useful when doing a large cross, if your cluster can take it. Prefer crossWithTiny

  9. def crossWithTiny(tiny: Pipe): Pipe

    Permalink

    Doing a cross product with even a moderate sized pipe can create ENORMOUS output.

    WARNING

    Doing a cross product with even a moderate sized pipe can create ENORMOUS output. The use-case here is attaching a constant (e.g. a number or a dictionary or set) to each row in another pipe. A common use-case comes from a groupAll and reduction to one row, then you want to send the results back out to every element in a pipe

    This uses joinWithTiny, so tiny pipe is replicated to all Mappers. If it is large, this will blow up. Get it: be foolish here and LOSE IT ALL!

    Use at your own risk.

  10. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  11. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  12. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  13. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  14. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  15. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  16. def joinWithLarger(fs: (Fields, Fields), that: Pipe, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

    Permalink

    same as reversing the order on joinWithSmaller

  17. def joinWithSmaller(fs: (Fields, Fields), that: Pipe, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

    Permalink

    Joins the first set of keys in the first pipe to the second set of keys in the second pipe.

    Joins the first set of keys in the first pipe to the second set of keys in the second pipe. All keys must be unique UNLESS it is an inner join, then duplicated join keys are allowed, but the second copy is deleted (as cascading does not allow duplicated field names).

    Smaller here means that the values/key is smaller than the left.

    Avoid going crazy adding more explicit join modes. Instead do for some other join mode with a larger pipe:

    .then { pipe => other.
              joinWithSmaller(('other1, 'other2)->('this1, 'this2), pipe, new FancyJoin)
          }
  18. def joinWithTiny(fs: (Fields, Fields), that: Pipe): Pipe

    Permalink

    This does an assymmetric join, using cascading's "HashJoin".

    This does an assymmetric join, using cascading's "HashJoin". This only runs through this pipe once, and keeps the right hand side pipe in memory (but is spillable).

    Choose this when Left > max(mappers,reducers) * Right, or when the left side is three orders of magnitude larger.

    joins the first set of keys in the first pipe to the second set of keys in the second pipe. Duplicated join keys are allowed, but the second copy is deleted (as cascading does not allow duplicated field names).

    Warning

    This does not work with outer joins, or right joins, only inner and left join versions are given.

  19. def joinerToJoinModes(j: Joiner): (Product with Serializable with JoinMode, Product with Serializable with JoinMode)

    Permalink
  20. def leftJoinWithLarger(fs: (Fields, Fields), that: Pipe, reducers: Int = 1): Pipe

    Permalink

    This is joinWithLarger with joiner parameter fixed to LeftJoin.

    This is joinWithLarger with joiner parameter fixed to LeftJoin. If the item is absent on the right put null for the keys and values

  21. def leftJoinWithSmaller(fs: (Fields, Fields), that: Pipe, reducers: Int = 1): Pipe

    Permalink

    This is joinWithSmaller with joiner parameter fixed to LeftJoin.

    This is joinWithSmaller with joiner parameter fixed to LeftJoin. If the item is absent on the right put null for the keys and values

  22. def leftJoinWithTiny(fs: (Fields, Fields), that: Pipe): HashJoin

    Permalink
  23. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  24. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  25. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  26. def skewJoinWithLarger(fs: (Fields, Fields), otherPipe: Pipe, sampleRate: Double = 0.001, reducers: Int = 1, replicator: SkewReplication = SkewReplicationA()): Pipe

    Permalink
  27. def skewJoinWithSmaller(fs: (Fields, Fields), otherPipe: Pipe, sampleRate: Double = 0.001, reducers: Int = 1, replicator: SkewReplication = SkewReplicationA()): Pipe

    Permalink

    Performs a skewed join, which is useful when the data has extreme skew.

    Performs a skewed join, which is useful when the data has extreme skew.

    For example, imagine joining a pipe of Twitter's follow graph against a pipe of user genders, in order to find the gender distribution of the accounts every Twitter user follows. Since celebrities (e.g., Justin Bieber and Lady Gaga) have a much larger follower base than other users, and (under a standard join algorithm) all their followers get sent to the same reducer, the job will likely be stuck on a few reducers for a long time. A skewed join attempts to alleviate this problem.

    This works as follows:

    1. First, we sample from the left and right pipes with some small probability, in order to determine approximately how often each join key appears in each pipe. 2. We use these estimated counts to replicate the join keys, according to the given replication strategy. 3. Finally, we join the replicated pipes together.

    sampleRate

    This controls how often we sample from the left and right pipes when estimating key counts.

    replicator

    Algorithm for determining how much to replicate a join key in the left and right pipes. Note: since we do not set the replication counts, only inner joins are allowed. (Otherwise, replicated rows would stay replicated when there is no counterpart in the other pipe.)

  28. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  29. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  30. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  31. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  32. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from AnyRef

Inherited from Any

Ungrouped