Approximate NDV based on hash algorithm in Oracle 11g

Date: 2011-09-15 04:39:58

Share to

Column statistics data is an important factor for CBO to estimate cost of execution plan. There are generally two kinds of statistics data on columns, one is the synopses, such as NDV, ACL and maximum/minimum data; the other is histogram. A single scanning table can get those synopses. To analyze the large tables with low cost, we normally sampling the data. However, sampling is a random process, which will cause large gap between statistics and actual data. Especially, for those huge tables, we normally adopt low sampling size, which will lead to unstable result of each analyzing. And those unstable result will result in a poor execution plan. Take below set of data as example,

    [0,1,1,1...(100*1)...1,2,3,4,5,6,7,8,9]

The actual NDV is 10. We may get below sample data according to randomness,

    [1...(10*1)...,2,6]

And sampling NDV we got is 3. Considering the sampling size, we will get NDV = 3/10*100 = 30. This result is quite different from the actual data.

To increase accuracy, we should increase sampling size. However, when calculate NDV, oracle should count distinct column data, which require the sorting algorithm that will consume PGA. For those large, PGA consumption will be to huge that data may be swapped to disk. Therefore, cause more serious performance issue.

in 11g, Oracle introduced a new algorithm based on hash to compute the NDV. It will not require additional PGA with scaning data size increased. Hence, it will be possible for oracle to analyze the statistics with full table scanning. In 11g, Oracle will scan full table if it's set to automatical sampling（AUTO_SAMPLE_SIZE）. It's also the default activity. We can turn it off by seting parameter APPROXIMATE_NDV.

SQL代码

HELLODBA.COM>exec dbms_stats.set_param('APPROXIMATE_NDV','FALSE');



PL/SQL procedure successfully completed.

Note: Oracle also adopts this algorithm to compute global statistics for partitioning table incrementally.

This algorithm utilized the balanced distribution of hash value. Herr is its basic procedure,

It convert a scanned data to a fixed-length binary hash data, and put it into a data structure which is called synopsis;

Scan the next data and get its binary hash data, compare it to the existing data in synopsis, discard it if it already exists, otherwise, insert it;

The size of synopsis is limited. Once it reached the limitation when inserting, oracle will split it basing on special rule. For example, discard those existing elements with leading 1. Correspodingly, the level of synopsis (start wiht 0) increase 1;

For those new scanned data, if it fit to the discarding rule, it wll not be inserted to synopsis;

For the next spliting, the rule is transitive. For example, discard data with leading 11.

Remaining elements number in synopsis is S, level of synopsis is I, then, the formular of NDV is

      NDV = S*2^I

With this algorithm, Oracle just keep one synopsis for each column, which means it will not consume addtional PGA along with increaing number of scaning data.

Here is a perl script to demostrate the new algorithm, you can download it at here.

SQL代码

D:\Documents\Hellodba.com>perl appr_ndv.pl 16 32 100

0 Scaning 8 (Hash: 0010000000000000) inserted.

1 Scaning 78 (Hash: 0011100000000010) inserted.

2 Scaning 80 (Hash: 0100000000000010) inserted.

...

36 Scaning 97 (Hash: 1000010000000011)  splitting (current level: 0) ( 0010000000000000 0011100000000010 010000000000001

0 1101010000000001 0001110000000000 1010000000000001 0000110000000010 0100010000000000 0000000000000000 0110000000000000

1100110000000001 0000110000000000 0010000000000010 1000100000000011 0110000000000010 0110110000000000 0101100000000010

1110000000000001 1000000000000011 0100110000000000 0101000000000010 0111000000000000 0111100000000000 0001110000000010 1

100100000000001 0101010000000000 0111010000000010 1011100000000001 1111110000000001 0101010000000010 1001000000000001 00

00000000000010 )

discarded.

...

99 Scaning 22 (Hash: 0101100000000000) discarded.

Approximated NDV:72( element num: 18, level: 2); Actual NDV:65

It gets the NDV with 72 computed by the new algorithm, while it's actual NDV is 65. They are so close.

--- Fuyuncat ---

Previous：Distinct Aggregate Transformation

Next：Oracle Sorting Algorithm

Return To this Category