Package

com.twitter.scalding

filecache

Permalink

package filecache

Content Hierarchy
Visibility
  1. Public
  2. All

Type Members

  1. sealed abstract class CachedFile extends AnyRef

    Permalink
  2. final case class HadoopCachedFile extends CachedFile with Product with Serializable

    Permalink
  3. final case class LocallyCachedFile extends CachedFile with Product with Serializable

    Permalink
  4. final case class UncachedFile extends Product with Serializable

    Permalink

Value Members

  1. object DistributedCacheFile

    Permalink

    The distributed cache is simply hadoop's method for allowing each node local access to a specific file.

    The distributed cache is simply hadoop's method for allowing each node local access to a specific file. The registration of that file must be called with the Configuration of the job, and not when it's on a mapper or reducer. Additionally, a unique name for the node-local access path must be used to prevent collisions in the cluster. This class provides this functionality.

    In the configuration phase, the file URI is used to construct an UncachedFile instance. The name of the symlink to use on the mappers is only available after calling the add() method, which registers the file and computes the unique symlink name and returns a CachedFile instance. The CachedFile instance is Serializable, it's designed to be assigned to a val and accessed later.

    The local symlink is available thorugh .file or .path depending on what type you need.

    example:

    class YourJob(args: Args) extends Job(args) {
      val theCachedFile = DistributedCacheFile("hdfs://ur-namenode/path/to/your/file.txt")
    
      def somethingThatUsesTheCachedFile() {
        doSomethingWith(theCachedFile.path) // or theCachedFile.file
      }
    }
  2. object URIHasher

    Permalink

Ungrouped