Represents something than can be CoGrouped with another CoGroupable
Closures are difficult for serialization.
Closures are difficult for serialization. This class avoids that.
This encodes the rules that 1) sorting is only possible before doing any reduce, 2) reversing is only possible after sorting.
This encodes the rules that 1) sorting is only possible before doing any reduce, 2) reversing is only possible after sorting. 3) unsorted Groups can be CoGrouped or HashJoined
This may appear a complex type, but it makes sure that code won't compile if it breaks the rule
Used for objects that may have a description set to be used in .dot and MR step names.
used for types that may know how many reducers they need e.g.
used for types that may know how many reducers they need e.g. CoGrouped, Grouped, SortedGrouped, UnsortedGrouped
If we can HashJoin, then we can CoGroup, but not vice-versa i.e., HashJoinable is a strict subset of CoGroupable (CoGrouped, for instance is CoGroupable, but not HashJoinable).
Only intended to be use to implement the hashCogroup on TypedPipe/Grouped
Creates a TypedPipe from an Iterable[T].
Creates a TypedPipe from an Iterable[T]. Prefer TypedPipe.from.
If you avoid toPipe, this class is more efficient than IterableSource.
This is for the case where you don't want to expose any structure but the ability to operate on an iterator of the values
Represents sharded lists of items of type T There are exactly two fundamental operations: toTypedPipe: marks the end of the grouped-on-key operations.
Represents sharded lists of items of type T There are exactly two fundamental operations: toTypedPipe: marks the end of the grouped-on-key operations. mapValueStream: further transforms all values, in order, one at a time, with a function from Iterator to another Iterator
Represents anything that starts as a TypedPipe of Key Value, where the value type has been erased.
Represents anything that starts as a TypedPipe of Key Value, where the value type has been erased. Acts as proof that the K in the tuple has an Ordering
This class is for the syntax enrichment enabling .joinBy on TypedPipes.
This class is for the syntax enrichment enabling .joinBy on TypedPipes. To access this, do import Syntax.joinOnMappablePipe
used for types that must know how many reducers they need e.g.
used for types that must know how many reducers they need e.g. Sketched
This type is used to implement .andThen on a function in a way that will never blow up the stack.
This type is used to implement .andThen on a function in a way that will never blow up the stack. This is done to prevent deep scalding TypedPipe pipelines from blowing the stack
This may be slow, but is used in scalding at planning time
Trait to assist with creating partitioned sources.
Trait to assist with creating partitioned sources.
Apart from the abstract members below, hdfsScheme
and localScheme
also need to be set.
Note that for both of them the sink fields need to be set to only include the actual fields
that should be written to file and not the partition fields.
Trait to assist with creating objects such as PartitionedTsv to read from separated files.
Trait to assist with creating objects such as PartitionedTsv to read from separated files. Override separator, skipHeader, writeHeader as needed.
Scalding source to read or write partitioned delimited text.
Scalding source to read or write partitioned delimited text.
For writing it expects a pair of (P, T)
, where P
is the data used for partitioning and
T
is the output to write out. Below is an example.
val data = List( (("a", "x"), ("i", 1)), (("a", "y"), ("j", 2)), (("b", "z"), ("k", 3)) ) IterablePipe(data, flowDef, mode) .write(PartitionedDelimited[(String, String), (String, Int)](args("out"), "col1=%s/col2=%s"))
For reading it produces a pair (P, T)
where P
is the partition data and T
is data in the
files. Below is an example.
val in: TypedPipe[((String, String), (String, Int))] = PartitionedDelimited[(String, String), (String, Int)](args("in"), "col1=%s/col2=%s")
Scalding source to read or write partitioned text.
Scalding source to read or write partitioned text.
For writing it expects a pair of (P, String)
, where P
is the data used for partitioning and
String
is the output to write out. Below is an example.
val data = List( (("a", "x"), "line1"), (("a", "y"), "line2"), (("b", "z"), "line3") ) IterablePipe(data, flowDef, mode) .write(PartitionTextLine[(String, String)](args("out"), "col1=%s/col2=%s"))
For reading it produces a pair (P, (Long, String))
where P
is the partition data, Long
is the offset into the file and String
is a line from the file. Below is an example.
val in: TypedPipe[((String, String), (Long, String))] = PartitionTextLine[(String, String)](args("in"), "col1=%s/col2=%s")
Base path of the partitioned directory
Template for the partitioned path
Text encoding of the file content
This is a class that models the logical portion of the reduce step.
This is a class that models the logical portion of the reduce step. details like where this occurs, the number of reducers, etc... are left in the Grouped class
This class is generally only created by users with the TypedPipe.sketch method
All sorting methods defined here trigger Hadoop secondary sort on key + value.
All sorting methods defined here trigger Hadoop secondary sort on key + value. Hadoop secondary sort is external sorting. i.e. it won't materialize all values of each key in memory on the reducer.
After sorting, we are no longer CoGroupable, and we can only call reverse in the initial SortedGrouped created from the Sortable: .sortBy(_._2).reverse for instance
After sorting, we are no longer CoGroupable, and we can only call reverse in the initial SortedGrouped created from the Sortable: .sortBy(_._2).reverse for instance
Once we have sorted, we cannot do a HashJoin or a CoGrouping
Creates a partition using the given template string.
Creates a partition using the given template string.
The template string needs to have %s as placeholder for a given field.
Think of a TypedPipe as a distributed unordered list that may or may not yet have been materialized in memory or disk.
Think of a TypedPipe as a distributed unordered list that may or may not yet have been materialized in memory or disk.
Represents a phase in a distributed computation on an input data source Wraps a cascading Pipe object, and holds the transformation done up until that point
This is a TypedPipe that delays having access to the FlowDef and Mode until toPipe is called
This is an instance of a TypedPipe that wraps a cascading Pipe
Opposite of TypedSource, used for writing into
This is the state after we have done some reducing.
This is the state after we have done some reducing. It is not possible to sort at this phase, but it is possible to do a CoGrouping or a HashJoin.
ValuePipe is special case of a TypedPipe of just a optional single element.
ValuePipe is special case of a TypedPipe of just a optional single element. It is like a distribute Option type It allows to perform scalar based operations on pipes like normalization.
Used for objects that may _set_ a description to be used in .dot and MR step names.
used for objects that may _set_ how many reducers they need e.g.
used for objects that may _set_ how many reducers they need e.g. CoGrouped, Grouped, SortedGrouped, UnsortedGrouped
Extension for TypedPipe to add a cumulativeSum method.
Extension for TypedPipe to add a cumulativeSum method. Given a TypedPipe with T = (GroupField, (SortField, SummableField)) cumulaitiveSum will return a SortedGrouped with the SummableField accumulated according to the sort field. eg: ('San Francisco', (100, 100)), ('San Francisco', (101, 50)), ('San Francisco', (200, 200)), ('Vancouver', (100, 50)), ('Vancouver', (101, 300)), ('Vancouver', (200, 100)) becomes ('San Francisco', (100, 100)), ('San Francisco', (101, 150)), ('San Francisco', (200, 300)), ('Vancouver', (100, 50)), ('Vancouver', (101, 350)), ('Vancouver', (200, 450))
If you provide cumulativeSum a partition function you get the same result but you allow for more than one reducer per group. This is useful for when you have a single group that has a very large number of entries. For example in the previous example if you gave a partition function of the form { _ / 100 } then you would never have any one reducer deal with more than 2 entries.
This object is the EmptyTypedPipe.
This object is the EmptyTypedPipe. Prefer to create it with TypedPipe.empty
Autogenerated methods for flattening the nested value tuples that result after joining many pipes together.
Autogenerated methods for flattening the nested value tuples that result after joining many pipes together. These methods can be used directly, or via the the joins available in MultiJoin.
lookupJoin simulates the behavior of a realtime system attempting to leftJoin (K, V) pairs against some other value type (JoinedV) by performing realtime lookups on a key-value Store.
lookupJoin simulates the behavior of a realtime system attempting to leftJoin (K, V) pairs against some other value type (JoinedV) by performing realtime lookups on a key-value Store.
An example would join (K, V) pairs of (URL, Username) against a service of (URL, ImpressionCount). The result of this join would be a pipe of (ShortenedURL, (Username, Option[ImpressionCount])).
To simulate this behavior, lookupJoin accepts pipes of key-value pairs with an explicit time value T attached. T must have some sensible ordering. The semantics are, if one were to hit the right pipe's simulated realtime service at any time between T(tuple) T(tuple + 1), one would receive Some((K, JoinedV)(tuple)).
The entries in the left pipe's tuples have the following meaning:
T: The time at which the (K, W) lookup occurred. K: the join key. W: the current value for the join key.
The right pipe's entries have the following meaning:
T: The time at which the "service" was fed an update K: the join K. V: value of the key at time T
Before the time T in the right pipe's very first entry, the simulated "service" will return None. After this time T, the right side will return None only if the key is absent, else, the service will return Some(joinedV).
This is an autogenerated object which gives you easy access to doing N-way joins so the types are cleaner.
This is an autogenerated object which gives you easy access to doing N-way joins so the types are cleaner. However, it just calls the underlying methods on CoGroupable and flattens the resulting tuple
Utility functions to assist with creating partitioned sourced.
Partitioned typed commma separated source.
Partitioned typed \1
separated source (commonly used by Pig).
Partitioned typed pipe separated source.
Partitioned typed tab separated source.
These are named syntax extensions that users can optionally import.
These are named syntax extensions that users can optionally import. Avoid import Syntax._
implicits for the type-safe DSL import TDsl._ to get the implicit conversions from Grouping/CoGrouping to Pipe, to get the .toTypedPipe method on standard cascading Pipes.
implicits for the type-safe DSL import TDsl._ to get the implicit conversions from Grouping/CoGrouping to Pipe, to get the .toTypedPipe method on standard cascading Pipes. to get automatic conversion of Mappable[T] to TypedPipe[T]
factory methods for TypedPipe, which is the typed representation of distributed lists in scalding.
factory methods for TypedPipe, which is the typed representation of distributed lists in scalding. This object is here rather than in the typed package because a lot of code was written using the functions in the object, which we do not see how to hide with package object tricks.
Some methods for comparing two typed pipes and finding out the difference between them.
Some methods for comparing two typed pipes and finding out the difference between them.
Has support for the normal case where the typed pipes are pipes of objects usable as keys in scalding (have an ordering, proper equals and hashCode), as well as some special cases for dealing with Arrays and thrift objects.
See diffByHashCode for comparing typed pipes of objects that have no ordering but a stable hash code (such as Scrooge thrift).
See diffByGroup for comparing typed pipes of objects that have no ordering *and* an unstable hash code.
This is an implementation detail (and should be marked private)