[中文]
Author:
fuyuncat
Source:
www.HelloDBA.com
Date:
2009-08-01 15:03:34
3.4.11. Max filter column position (with mixed filter column data type)
When I using a mixed filter column data type in the test, I found the effect of MAXFILPOS in the formula is not linear, but changes is still regular:
date
NUMBER
MAXFILPOS
5
EFFCOLNUMS
5
FLTCOLNUM
resc_cpu
CPU_B
CPU_A
TYPFAC
1
387121440
7121440
380000000
380
2
402121440
7121440
395000000
395
3
402871440
7121440
395750000
395.75
4
402908940
7121440
395787500
395.7875
5
402910815
7121440
395789375
395.78938
date
NUMBER
MAXFILPOS
4
EFFCOLNUMS
5
FLTCOLNUM
resc_cpu
CPU_B
CPU_A
TYPFAC
1
368121440
7121440
361000000
361
19
2
382171440
7121440
375050000
375.05
19.95
3
382873940
7121440
375752500
375.7525
19.9975
4
382909060
7121440
375787620
375.78762
19.99988
date
NUMBER
MAXFILPOS
3
EFFCOLNUMS
5
FLTCOLNUM
resc_cpu
CPU_B
CPU_A
TYPFAC
1
349121440
7121440
342000000
342
19
2
362221440
7121440
355100000
355.1
19.95
3
362876440
7121440
355755000
355.755
19.9975
Observe the data, I finally get the rule:
F(MAXFILPOS)–F(MAXFILPOS-1) = 20–20^(1-FLTCOLNUM)
Put this equation into formula 12, we get the new formula:
Formula 13: TYPFAC = 130 + EFFCOLNUMS + 5*19 - (5-MAXFLTCOLPOS)*(20–20^(1-FLTCOLNUM)) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)
= 225 + EFFCOLNUMS - (5-MAXFLTCOLPOS)*(20–20^(1-FLTCOLNUM)) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)
= 225 + EFFCOLNUMS -100 + MAXFLTCOLPOS*20 + (5-MAXFLTCOLPOS)*20^(1-FLTCOLNUM) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)
= 125 + EFFCOLNUMS + MAXFLTCOLPOS*20 + (5-MAXFLTCOLPOS)*20^(1-FLTCOLNUM) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)
3.4.12. Effective column number (with mixed filter column data type)
Besides MAXFLTCOLPOS, I also found other factor violated the formula we ever got in the query with mixed filter column data type, which is EFFCOLNUM:
date
NUMBER
MAXFILPOS
3
EFFCOLNUMS
5
FLTCOLNUM
resc_cpu
CPU_B
CPU_A
TYPFAC
1
349121440
7121440
342000000
342
2
362221440
7121440
355100000
355.1
3
362876440
7121440
355755000
355.755
date
NUMBER
MAXFILPOS
3
EFFCOLNUMS
4
FLTCOLNUM
resc_cpu
CPU_B
CPU_A
TYPFAC
1
348121440
7121440
341000000
341
1
2
362171440
7121440
355050000
355.05
0.05
3
362873940
7121440
355752500
355.7525
0.0025
date
NUMBER
MAXFILPOS
3
EFFCOLNUMS
3
FLTCOLNUM
resc_cpu
CPU_B
CPU_A
TYPFAC
1
347121440
7121440
340000000
340
1
2
362121440
7121440
355000000
355
0.05
3
362871440
7121440
355750000
355.75
0.0025
Fortunately, the rule is not hard to be found:
F(EFFCOLNUMS)–F(EFFCOLNUMS-1) = 20^(1-FLTCOLNUM)
Modify formula 13 base on this rule:
Formula 14: TYPFAC = 130+(20*EFFCOLNUM-100)*(20^(-FILTCLNUM))+MAXFLTCOLPOS*20+(5-MAXFLTCOLPOS)*20^(1-FLTCOLNUM)+ COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)
= 130 + EFFCOLNUM*20^(1-FILTCOLNUM) - 5*20^(1-FILTCOLNUM) + MAXFLTCOLPOS*20 + (5-MAXFLTCOLPOS)*20^(1-FLTCOLNUM) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)
= 130 + EFFCOLNUM*20^(1-FILTCOLNUM) + MAXFLTCOLPOS*(20 - 20^(1-FLTCOLNUM))) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)
= 130 + EFFCOLNUM*20^(1-FILTCOLNUM) + MAXFLTCOLPOS*(20 - 20^(1-FLTCOLNUM))) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)
= 130 + EFFCOLNUM*20^(1-FILTCOLNUM) + MAXFLTCOLPOS*(20 - 20^(1-FLTCOLNUM))) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)
Consider formula 8-1, it can be transformed as:
Formula 15: TYPFAC = 130 + (EFFQRYCOLNUM + MAXFLTCOLPOS)*20^(1-FILTCOLNUM) + MAXFLTCOLPOS*(20 - 20^(1-FLTCOLNUM))) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)
= 130 + EFFQRYCOLNUM*(1/20)^(FILTCOLNUM-1) + MAXFLTCOLPOS*20+ COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)
3.4.13. Filter columns computing sequence (with mixed filter column data type)
We mentioned before, if the data types of 2 columns be changed in turn, it is difficult to see the rule similar to previous data. So I guess the optimizer will compute the filter columns in a special sequence (it’s the truth, can be controlled by hints, which is out of range of our article), to get the value of the power (the n in COLTYPEFAC2 * (1 / 20) ^ n) above.
A closer observation on the test data with mixed data types, we can still find many other rules (two columns in this example):
COL1
COL2
MAXFLTPOS
COST_CPU
varchar2
date
3
262221440
number
char
4
274671440
date
varchar2
5
302121440
char
varchar2
1
269671440
varchar2
number
2
294621440
… …
… …
Then put the above data into the formula 12, we found such a rule: when mix CHAR/VARCHAR2 with NUMBER/DATE, the CHAR/VARCHAR2 column has the smaller power value; when fix NUMBER with DATE, NUMBER column will have the smaller power value. Considering their COLTYPEFAC, we can draw such a assumption 9: Optimizer will calculate the AND filter columns in their COLTYPEFAC sequence, from small to big, increasing power value start from 1.
I adopted a more complex combination of mixed data type columns to prove this assumption:
COLA
COLB
COLC
COLD
COLE
MAXFLTPOS
COST_CPU
VARCAHR2
number
date
varchar2
date
5
290035815
Number
varchar2
date
date
varchar2
5
290035815
In these 2 copies of data, in spite of the physical location of the column, they have the same set of data types: (VARCHAR2, NUMBER, DATE, VARCHAR2, DATE). According to assumption 9, their uniform sequence is (DATE, DATE, NUMBER, VARCHAR2, VARCHAR2), therefore calculated the same TYPEFAC:
130 +20 * 5 +50 * (1 / 20) ^ 0 +50 * (1 / 20) ^ 1 +150 * (1 / 20) ^ 2 +300 * (1 / 20) ^ 3 +300 * (1 / 20) ^ 4 = 282.9144 = 282.914375
The results is same the results calculate from formula 5:
(290035815 - (4500/8192 +0.32) * 1000 * 81929) / 1000000 = 282.914375
Through other random data, we can find that the assumption 9 is valid.Note: In fact, this sequence rule is not the finally rule, it’s the one we can find under current circumstance. We will find the real rule when I involved more complex conditions.
3.4.14. Filter operator (=)
Prior to that, our filter operator is > or <. How about to change it to other operator? We first try on =.
3.4.14.1. Distinct value number of the column
Use the same method and process as > operator, it comes to the formula (here, I did not take query column whose position is greater than the max filter column position column into account, because it’s relatively complexity, I will analysis it in further later):
130 + MAXFLTCOLPOS * 20 + COLTYPEFAC1 * (1 / 1600) ^ 0 + COLTYPEFAC2 * (1 / 1600) ^ 1 + ... + COLTYPEFACn * (1 / 1600) ^ (FLTCOLNUM-1)
Compare to the formula of >, the obvious difference is filter expression in the sub-base, form 1/20 to 1/1600. By reference to the data of baseline table, the result is easily let me associate another test I eve did: column selectivity (see the article 查询计划中Cardinality的计算). In that test, there is a preliminary conclusion: when use bind variables, the selectivity of >, <, <= and >= is 1/20, and the selectivity of = is 1/NDV. Here 1600 is the uniform number of distinct values (NDV) of all of the columns.
By modifying the NDV, it proves that this assumption is valid. Formula can therefore be changed as:Formula 16: 130 + MAXFLTCOLPOS*20+ COLTYPEFAC1*(1/NDV)^0 + COLTYPEFAC2*(1/NDV)^1 + … + COLTYPEFACn*(1/NDV)^(FLTCOLNUM-1)
3.4.14.2. Mixed NDV
Because the all of the columns of the base table have been set to the same NDV, after come out the formula 16, I naturally think of another question: If the filters have the different NDV, is sub-expression also different? Still look at data with various columns NDV:
A:160
B:80
C:100
D:200
E:400
FILTER
CPU_B
CPU_A
TYPFAC
CACULATED
a = :v1 b = :v2
478996440
7121440
471.875
471.875
c = :v3 b = :v2
500121440
7121440
493
493
a = :v1 b = :v2 c = :v3
499015190
7121440
491.89375
491.89375
b = :v2 d = :v4 e = :v5
537875190
7121440
530.75375
530.75375
a = v1 b = :v2 c = :v4 e = :v5
537876174
7121440
530.75473
530.7547344
c = v3 b = :v2 d = :v4 e = :v5
537875228
7121440
530.75379
530.7537875
a = :v1 c = v3 b = :v2 d = :v4 e = :v5
537875214
7121440
530.75377
530.7537737
… …
From these data, the rule can be found: the effective NDV on each filter column is product of NDV of the columns prior to it. For example, under the assumption 9, the fifth data above: the effective NDV of the first column of is 1 (none previous column), the effective NDV of the second column is 400, the effective NDV of the third column is 400 * 200 ... ..., and its computing expression as follows:
30 + 5 * 20 + 300 * (1 / 1) + 300 * (1 / 400) + 300 * (1 / 400) * (1 / 200) + 300 * (1 / 400) * (1 / 200) * (1 / 160) + 300 * (1 / 400) * (1 / 200) * (1 / 160) * (1 / 100) = 530.7537875
According to this rule, the formula changes as follows:
Formula 16: 130 + MAXFLTCOLPOS*20+ COLTYPEFAC1*(1/1) + COLTYPEFAC2*(1/MAXNDV1) + … + COLTYPEFACn*(1/MAXNDV1)*…*(1/MAXNDVn-1)
Note: the formula can also be applied to <, >, while its NDV is 20.
3.4.14.3. Mixed data types
In the derivation before, we have confirmed columns have different data types, it will calculate the filter column by the sequence based on data type factors, from small to big. However, if there are the mixed data types with mixed NDV, what will be the result? Modify the data type of the base table above; get 2 sets of test data:
A:VARCHAR2
B:DATE
C:NUMBER
D:NUMBER
E:VARCHAR2
FILTER
CPU_B
CPU_A
TYPFAC
CACULATED
a = :v1 b = :v2
228996440
7121440
221.875
221.875
c = :v3 b = :v2
350121440
7121440
343
343
a = :v1 b = :v2 c = :v3
248077690
7121440
240.95625
240.95625
b = :v2 d = :v4 e = :v5
287500190
7121440
280.37875
280.37875
a = v1 b = :v2 c = :v4 e = :v5
287248831
7121440
280.12739
280.1273906
c = v3 b = :v2 d = :v4 e = :v5
287498353
7121440
280.37691
280.3769125
a = :v1 c = v3 b = :v2 d = :v4 e = :v5
287248796
7121440
280.12736
280.1273557
A:VARCHAR2
B:NUMBER
C:DATE
D:DATE
E:NUMBER
FILTER
CPU_B
CPU_A
TYPFAC
CACULATED
a = :v1 b = :v2
228058940
7121440
220.9375
220.9375
c = :v3 b = :v2
350871440
7121440
343.75
343.75
a = :v1 b = :v2 c = :v3
248082378
7121440
240.96094
240.9609375
b = :v2 d = :v4 e = :v5
387505815
7121440
380.38438
380.384375
a = v1 b = :v2 c = :v4 e = :v5
288061342
7121440
280.9399
280.9399023
c = v3 b = :v2 d = :v4 e = :v5
387505862
7121440
380.38442
380.3844219
a = :v1 c = v3 b = :v2 d = :v4 e = :v5
288061343
7121440
280.9399
280.9399026
From these data, a rule can be get: the compute sequence is first considering COLTYPEFAC, from small to large, and then in accordance with selectivity (1/NDV), from small to large, the effective NDV on each filter column is product of NDV of the columns prior to it. For example, the fifth of the second set data above, the expression is:
130+5*20+150*1+150*(1/400)+300*(1/400)*(1/80)+300*(1/400)*(1/80)*(1/200) = 280.9399026
3.4.14.4. Effective query column
We have said that the column whose location position larger than the max filter column position is effective query column. Operated by =, effective query column will be more complex (in fact, it’s also the case in the < formula, but can not easily be found). Look back to the effective column in the < query formula first:
TYPFAC = 130 + MAXFLTCOLPOS*20+ COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1) + 20*EFFQRYCOLNUM*(1/20)^FILTCOLNUM
By comparing the = formula, we can image the (1/20) in the expression 20*EFFQRYCOLNUM*(1/20)^FILTCOLNUM is the selectivity of the effective query columns. The testing data prove it’s right:
FLTCOLUMNS
2
MAXFLTCOLPOS
3
Density: 1/40
EFFQRYCOLNUM
COST_CPU
CPU_B
CPU_A
TYPFAC
CACULATED
1
504633940
7121440
497512500
497.5125
0.0125
2
504646440
7121440
497525000
497.525
0.025
3
504658940
7121440
497537500
497.5375
0.0375
4
504671440
7121440
497550000
497.55
0.05
However, we find some special case soon,
MAXFLTPOS: 3
Density
1/200
FLTCOLUMNS
3
EFFQRYCOLNUM
COST_CPU
CPU_B
CPU_A
TYPFAC
CACULATED
1
498628960
7121440
491507520
491.50752
2E-05
2
498628980
7121440
491507540
491.50754
2E-05
3
498629000
7121440
491507560
491.50756
2E-05
4
498629020
7121440
491507580
491.50758
2E-05
MAXFLTPOS
3
Density
1/800
FLTCOLUMNS
2
EFFQRYCOLNUM
COST_CPU
CPU_B
CPU_A
TYPFAC
1
497496480
7121440
490375040
490.37504
4E-05
2
497496520
7121440
490375080
490.37508
4E-05
Studied these cases, I found another rule: the increasing will stop at a threshold. How can we get this threshold? Take it easy, ignore it first, we look at another factor first.
3.4.14.5. Table records number
In the derivation of </> formula, we have come to a conclusion: NDV and the records number of the table will not affect COST_CPU. However, in the = formula, NDV has become proved to be one of the factors, then, we should re-inspected the number records of table.
Note: I have ever tested lots of cases to identify the factor, and will just list a few of them as example:
rows
1000
tabblks
100
Density
1/80
FLTCOLUMNS
1
EFFQRYCOLNUM
COST_CPU
CPU_B
CPU_A
TYPFAC
CACULATED
1
1202404
712144
490260
490.26
0.26
2
1202664
712144
490520
490.52
0.26
3
1202924
712144
490780
490.78
0.26
Base on these, I finally get the expression of effective query column in the formula:
20*ROUND((1/NDV)^FLTCOLNUM*TABROWS,0)/TABROWS
After this, we can continue the job on the threshold; it can be drawn as follow,
20*MAX(1,ROUND((1/NDV)^FLTCOLNUM*TABROWS,0))/TABROWS
Formula 16 is transformed as,
Formula 17: TYPFAC = 130 + MAXFLTCOLPOS*20+ COLTYPEFAC1*(1/1) + COLTYPEFAC2*(1/MAXNDV1) + … + COLTYPEFACn*(1/MAXNDV1)*…*(1/MAXNDVn-1) + 20*EFFQRYCOLNUM*MAX(1,ROUND((1/NDV)^FLTCOLNUM*TABROWS,0))/TABROWS
However, the NDV of all columns in these cases are set as same, we should consider their difference now.
A:160
B:80
C:100
D:200
E:400
A:VARCHAR2
B:NUMBER
C:DATE
D:DATE
E:NUMBER
MAXFILPOS
4
EFFCOLNUMS
4
EFFQRYCOLNUM
COST_CPU
CPU_B
TYPFAC
CACULATED
1
275409060
7121440
268.28762
0.00012
2
275409180
7121440
268.28774
0.00012
MAXFILPOS
5
EFFCOLNUMS
5
1
295035835
7121440
287.914395
0.00002
2
295035855
7121440
287.914415
0.00002
The rule is not hard to be found: the effective NDV of effective query columns is the product of the NDV of all filter columns. Base on it, the formula become,
Formula 18: TYPFAC = 130 + MAXFLTCOLPOS*20+ COLTYPEFAC1*(1/1) + COLTYPEFAC2*(1/MAXNDV1) + … + COLTYPEFACn*(1/MAXNDV1)*…*(1/MAXNDVn-1) + 20*EFFQRYCOLNUM*MAX(1,ROUND((1/MAXNDV1)*…*(1/MAXNDVn)*TABROWS,0))/TABROWS
Note: this formula can also be applied to </> case. Base table records number I used before is large (1000000), it quire more than 5 filter columns (ROUND ((1 / 20) ^ 5 * 1000000,0) = 0 <1) to get the threshold, so the threshold had not been noted. We can confirm it also exists in </> case by testing.