com.redhat.et.silex.sample.split

SplitSampleRDDFunctions

class SplitSampleRDDFunctions[T] extends Serializable

Enhances RDDs with methods for split-sampling

T

The row type of the RDD

// import conversions to enhance RDDs with split sampling
import com.redhat.et.silex.sample.split.implicits._

// obtain a sequence of 5 RDDs randomly split from RDD 'data', where each element
// has probability 1/5 of being assigned to each output.
val splits = data.splitSample(5)

// randomly split data so that the second output has twice the probability of receiving
// a data element as the first, and the third output has three times the probability.
val splitsW = data.weightedSplitSample(Seq(1.0, 2.0, 3.0))
Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. SplitSampleRDDFunctions
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SplitSampleRDDFunctions(self: RDD[T])(implicit arg0: ClassTag[T])

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  7. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  10. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  11. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  12. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  13. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  14. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  15. final def notify(): Unit

    Definition Classes
    AnyRef
  16. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  17. def splitSample(n: Int, persist: StorageLevel = defaultSL, seed: Long = scala.util.Random.nextLong): Seq[RDD[T]]

    Split an RDD into n random subsets, where each row is assigned to an output with equal probability 1/n.

    Split an RDD into n random subsets, where each row is assigned to an output with equal probability 1/n.

    n

    The number of output RDDs to split into

    persist

    The storage level to use for persisting the intermediate result.

    seed

    A random seed to use for sampling. Will be modified, deterministically, by partition id.

  18. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  19. def toString(): String

    Definition Classes
    AnyRef → Any
  20. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  21. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  22. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  23. def weightedSplitSample(weights: Seq[Double], persist: StorageLevel = defaultSL, seed: Long = scala.util.Random.nextLong): Seq[RDD[T]]

    Split an RDD into weighted random subsets, where each row is assigned to an output (j) with probability proportional to the corresponding jth weight.

    Split an RDD into weighted random subsets, where each row is assigned to an output (j) with probability proportional to the corresponding jth weight.

    weights

    A sequence of weights that determine the relative probabilities of sampling into the corresponding RDD outputs. Weights will be normalized so that they sum to 1. Individual weights must be strictly > 0.

    persist

    The storage level to use for persisting the intermediate result.

    seed

    A random seed to use for sampling. Will be modified, deterministically, by partition id.

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped