Which Partition We Have To Use For Aggregator Stage In Parallel Jobs?

Aggregator Stage In Parallel Jobs
Aggregator Stage In Parallel Jobs

Aggregator Stage is a processing stage in DataStage is used to grouping and summary operations. By Default Aggregator stage will execute in parallel mode in parallel jobs. In this article, we are going to discuss which partition we have to use in aggregator stage in parallel jobs.

When it comes to the parallel environment, the way that we partition data before grouping and summary will affect the results. If you partition data using round-robin method and then records with same key values will distribute across different partitions and that will give incorrect results.

Aggregation Method:

There are two different aggregation methods, they are discussed as follows:

Hash: Use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory.

Sort: Sort mode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys. Unlike the Hash Aggregator, the Sort Aggregator requires presorted data, but only maintains the calculations for the current group in memory.

Aggregator Stage In Parallel Jobs

By default aggregator, stage calculation output column is double data type and if you want decimal output then add following property as shown in below figure.

Aggregator Stage In Parallel Jobs
Aggregator Stage In Parallel Jobs
Aggregator Stage In Parallel Jobs
Aggregator Stage In Parallel Jobs

If you are using the single key column for the grouping keys then there is no need to sort or hash partition the incoming data.

Important Notes to Consider

Choose a partition method which makes sure that the number of rows per partition is close to equal. This will minimize the processing workload and thereby improves the overall run time.

Any stage that process a group of related records must be partitioned using a keyed partition technique. (Example: in the case of Aggregator stage, Remove duplicate, Change capture, Change apply, Join, Merge stages etc; as well as for transformers that process group of related records)

Minimize repartitioning as it decreases the performance unless the partition distribution is highly skewed. Repartitioning results in an overhead of network transport, as well as even distribution of data among partitions, is also gets disturbed.

Read “What is the need of Link Partitioner and Link Collector in DataStage?

Also, Click the Below Links to Read,

DataStage Components

Message Handler

Cumulative Sum Cumulative Solve With Steps

Hope the given details provided you the clear detail about the partition; we have to use in aggregation stage in DataStage.

Subscribe Tutorial Chat and join us! We provide FREE tutorial guides written by qualified tutors. Follow us!