HelloDBA [中文]
Search Internet Search HelloDBA
  Oracle Technology Site. email: fuyuncat@gmail.com  MSN: fuyuncat@hotmail.com Add to circle  acoug  acoug 

Approximate NDV based on hash algorithm in Oracle 11g

[中文]

Author:  fuyuncat

Source:  www.HelloDBA.com

Date:  2011-09-15 04:39:58

Share to  Twitter Facebook GMail Blogger Orkut Google Bookmarks Sina Weibo QQ Renren Tieba Kaixin Douban Taobao

Column statistics data is an important factor for CBO to estimate cost of execution plan. There are generally two kinds of statistics data on columns, one is the synopses, such as NDV, ACL and maximum/minimum data; the other is histogram. A single scanning table can get those synopses. To analyze the large tables with low cost, we normally sampling the data. However, sampling is a random process, which will cause large gap between statistics and actual data. Especially, for those huge tables, we normally adopt low sampling size, which will lead to unstable result of each analyzing. And those unstable result will result in a poor execution plan. Take below set of data as example,

    [0,1,1,1...(100*1)...1,2,3,4,5,6,7,8,9]

The actual NDV is 10. We may get below sample data according to randomness,

    [1...(10*1)...,2,6]

And sampling NDV we got is 3. Considering the sampling size, we will get NDV = 3/10*100 = 30. This result is quite different from the actual data.

To increase accuracy, we should increase sampling size. However, when calculate NDV, oracle should count distinct column data, which require the sorting algorithm that will consume PGA. For those large, PGA consumption will be to huge that data may be swapped to disk. Therefore, cause more serious performance issue.

in 11g, Oracle introduced a new algorithm based on hash to compute the NDV. It will not require additional PGA with scaning data size increased. Hence, it will be possible for oracle to analyze the statistics with full table scanning. In 11g, Oracle will scan full table if it's set to automatical sampling(AUTO_SAMPLE_SIZE). It's also the default activity. We can turn it off by seting parameter APPROXIMATE_NDV.

SQL代码
  1. HELLODBA.COM>exec dbms_stats.set_param('APPROXIMATE_NDV','FALSE');  
  2.   
  3. PL/SQL procedure successfully completed.  

Note: Oracle also adopts this algorithm to compute global statistics for partitioning table incrementally.

This algorithm utilized the balanced distribution of hash value. Herr is its basic procedure,

  • It convert a scanned data to a fixed-length binary hash data, and put it into a data structure which is called synopsis;
  • Scan the next data and get its binary hash data, compare it to the existing data in synopsis, discard it if it already exists, otherwise, insert it;
  • The size of synopsis is limited. Once it reached the limitation when inserting, oracle will split it basing on special rule. For example, discard those existing elements with leading 1. Correspodingly, the level of synopsis (start wiht 0) increase 1;
  • For those new scanned data, if it fit to the discarding rule, it wll not be inserted to synopsis;
  • For the next spliting, the rule is transitive. For example, discard data with leading 11.
  • Remaining elements number in synopsis is S, level of synopsis is I, then, the formular of NDV is

          NDV = S*2^I

With this algorithm, Oracle just keep one synopsis for each column, which means it will not consume addtional PGA along with increaing number of  scaning data.

Here is a perl script to demostrate the new algorithm, you can download it at here.

SQL代码
  1. D:\Documents\Hellodba.com>perl appr_ndv.pl 16 32 100  
  2. 0 Scaning 8 (Hash: 0010000000000000) inserted.  
  3. 1 Scaning 78 (Hash: 0011100000000010) inserted.  
  4. 2 Scaning 80 (Hash: 0100000000000010) inserted.  
  5. ...  
  6. 36 Scaning 97 (Hash: 1000010000000011)  splitting (current level: 0) ( 0010000000000000 0011100000000010 010000000000001  
  7. 0 1101010000000001 0001110000000000 1010000000000001 0000110000000010 0100010000000000 0000000000000000 0110000000000000  
  8.  1100110000000001 0000110000000000 0010000000000010 1000100000000011 0110000000000010 0110110000000000 0101100000000010  
  9. 1110000000000001 1000000000000011 0100110000000000 0101000000000010 0111000000000000 0111100000000000 0001110000000010 1  
  10. 100100000000001 0101010000000000 0111010000000010 1011100000000001 1111110000000001 0101010000000010 1001000000000001 00  
  11. 00000000000010 )  
  12. discarded.  
  13. ...  
  14. 99 Scaning 22 (Hash: 0101100000000000) discarded.  
  15. Approximated NDV:72( element num: 18, level: 2); Actual NDV:65  

It gets the NDV with 72 computed by the new algorithm, while it's actual NDV is 65. They are so close.

--- Fuyuncat ---

Top

Copyright ©2005, HelloDBA.Com All reseverd.

Declaration
by fuyuncat