ApproximateQuantiles.ApproximateQuantilesCombineFn (Google Cloud Dataflow SDK 1.9.1 API)

Google Cloud Dataflow SDK for Java, version 1.9.1

com.google.cloud.dataflow.sdk.transforms

Class ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT extends Comparator<T> & Serializable>

  • Type Parameters:
    T - the type of the values being combined
    All Implemented Interfaces:
    CombineFnBase.GlobalCombineFn<T,com.google.cloud.dataflow.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT>,List<T>>, HasDisplayData, Serializable
    Enclosing class:
    ApproximateQuantiles


    public static class ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT extends Comparator<T> & Serializable>
    extends Combine.AccumulatingCombineFn<T,com.google.cloud.dataflow.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT>,List<T>>
    The ApproximateQuantilesCombineFn combiner gives an idea of the distribution of a collection of values using approximate N-tiles. The output of this combiner is a List of size numQuantiles, containing the input values' minimum value, numQuantiles-2 intermediate values, and maximum value, in sorted order, so for traditional N-tiles, one should use ApproximateQuantilesCombineFn#create(N+1).

    If there are fewer values to combine than numQuantiles, then the result List will contain all the values being combined, in sorted order.

    Values are ordered using either a specified Comparator or the values' natural ordering.

    To evaluate the quantiles we use the "New Algorithm" described here:

       [MRL98] Manku, Rajagopalan & Lindsay, "Approximate Medians and other
       Quantiles in One Pass and with Limited Memory", Proc. 1998 ACM
       SIGMOD, Vol 27, No 2, p 426-435, June 1998.
       http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
     

    The default error bound is 1 / N, though in practice the accuracy tends to be much better. See create(int, Comparator, long, double) for more information about the meaning of epsilon, and withEpsilon(double) for a convenient way to adjust it.

    See Also:
    Serialized Form
    • Field Detail

      • DEFAULT_MAX_NUM_ELEMENTS

        public static final long DEFAULT_MAX_NUM_ELEMENTS
        The cost (in time and space) to compute quantiles to a given accuracy is a function of the total number of elements in the data set. If an estimate is not known or specified, we use this as an upper bound. If this is too low, errors may exceed the requested tolerance; if too high, efficiency may be non-optimal. The impact is logarithmic with respect to this value, so this default should be fine for most uses.
        See Also:
        Constant Field Values
    • Method Detail

      • create

        public static <T,ComparatorT extends Comparator<T> & SerializableApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT> create(int numQuantiles,
                                                                                                                                                    ComparatorT compareFn)
        Returns an approximate quantiles combiner with the given compareFn and desired number of quantiles. A total of numQuantiles elements will appear in the output list, including the minimum and maximum.

        The Comparator must be Serializable.

        The default error bound is 1 / numQuantiles, which holds as long as the number of elements is less than DEFAULT_MAX_NUM_ELEMENTS.

      • create

        public static <T,ComparatorT extends Comparator<T> & SerializableApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT> create(int numQuantiles,
                                                                                                                                                    ComparatorT compareFn,
                                                                                                                                                    long maxNumElements,
                                                                                                                                                    double epsilon)
        Creates an approximate quantiles combiner with the given compareFn and desired number of quantiles. A total of numQuantiles elements will appear in the output list, including the minimum and maximum.

        The Comparator must be Serializable.

        The default error bound is epsilon, which holds as long as the number of elements is less than maxNumElements. Specifically, if one considers the input as a sorted list x_1, ..., x_N, then the distance between the each exact quantile x_c and its approximation x_c' is bounded by |c - c'| < epsilon * N. Note that these errors are worst-case scenarios; in practice the accuracy tends to be much better.

      • getAccumulatorCoder

        public Coder<com.google.cloud.dataflow.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT>> getAccumulatorCoder(CoderRegistry registry,
                                                                                                                                     Coder<T> elementCoder)
        Description copied from interface: CombineFnBase.GlobalCombineFn
        Returns the Coder to use for accumulator AccumT values, or null if it is not able to be inferred.

        By default, uses the knowledge of the Coder being used for InputT values and the enclosing Pipeline's CoderRegistry to try to infer the Coder for AccumT values.

        This is the Coder used to send data through a communication-intensive shuffle step, so a compact and efficient representation may have significant performance benefits.

        Specified by:
        getAccumulatorCoder in interface CombineFnBase.GlobalCombineFn<T,com.google.cloud.dataflow.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT extends Comparator<T> & Serializable>,List<T>>
      • populateDisplayData

        public void populateDisplayData(DisplayData.Builder builder)
        Register display data for the given transform or component.

        populateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect display data via DisplayData.from(HasDisplayData). Implementations may call super.populateDisplayData(builder) in order to register display data in the current namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use the namespace of the subcomponent.

        By default, does not register any display data. Implementors may override this method to provide their own display data.

        Specified by:
        populateDisplayData in interface HasDisplayData
        Parameters:
        builder - The builder to populate with display data.
        See Also:
        HasDisplayData


Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataflow