Full Table Scan cost formula cracking(4)

Date: 2009-08-01 15:03:34

Share to

3.4.11.           Max filter column position (with mixed filter column data type)

When I using a mixed filter column data type in the test, I found the effect of MAXFILPOS in the formula is not linear, but changes is still regular:

date

NUMBER

MAXFILPOS

5

EFFCOLNUMS

5

FLTCOLNUM

resc_cpu

CPU_B

CPU_A

TYPFAC

1

387121440

7121440

380000000

380

2

402121440

7121440

395000000

395

3

402871440

7121440

395750000

395.75

4

402908940

7121440

395787500

395.7875

5

402910815

7121440

395789375

395.78938

date

NUMBER

MAXFILPOS

4

EFFCOLNUMS

5

FLTCOLNUM

resc_cpu

CPU_B

CPU_A

TYPFAC

1

368121440

7121440

361000000

361

19

2

382171440

7121440

375050000

375.05

19.95

3

382873940

7121440

375752500

375.7525

19.9975

4

382909060

7121440

375787620

375.78762

19.99988

date

NUMBER

MAXFILPOS

3

EFFCOLNUMS

5

FLTCOLNUM

resc_cpu

CPU_B

CPU_A

TYPFAC

1

349121440

7121440

342000000

342

19

2

362221440

7121440

355100000

355.1

19.95

3

362876440

7121440

355755000

355.755

19.9975

Observe the data, I finally get the rule:

F(MAXFILPOS)–F(MAXFILPOS-1) = 20–20^(1-FLTCOLNUM)

Put this equation into formula 12, we get the new formula:

Formula 13: TYPFAC = 130 + EFFCOLNUMS + 5*19 - (5-MAXFLTCOLPOS)*(20–20^(1-FLTCOLNUM)) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)

        = 225 + EFFCOLNUMS - (5-MAXFLTCOLPOS)*(20–20^(1-FLTCOLNUM)) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)

        = 225 + EFFCOLNUMS -100 + MAXFLTCOLPOS*20 + (5-MAXFLTCOLPOS)*20^(1-FLTCOLNUM) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)

        = 125 + EFFCOLNUMS + MAXFLTCOLPOS*20 + (5-MAXFLTCOLPOS)*20^(1-FLTCOLNUM) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)

3.4.12.           Effective column number (with mixed filter column data type)

Besides MAXFLTCOLPOS, I also found other factor violated the formula we ever got in the query with mixed filter column data type, which is EFFCOLNUM:

date

NUMBER

MAXFILPOS

3

EFFCOLNUMS

5

FLTCOLNUM

resc_cpu

CPU_B

CPU_A

TYPFAC

1

349121440

7121440

342000000

342

2

362221440

7121440

355100000

355.1

3

362876440

7121440

355755000

355.755

date

NUMBER

MAXFILPOS

3

EFFCOLNUMS

4

FLTCOLNUM

resc_cpu

CPU_B

CPU_A

TYPFAC

1

348121440

7121440

341000000

341

1

2

362171440

7121440

355050000

355.05

0.05

3

362873940

7121440

355752500

355.7525

0.0025

date

NUMBER

MAXFILPOS

3

EFFCOLNUMS

3

FLTCOLNUM

resc_cpu

CPU_B

CPU_A

TYPFAC

1

347121440

7121440

340000000

340

1

2

362121440

7121440

355000000

355

0.05

3

362871440

7121440

355750000

355.75

0.0025

Fortunately, the rule is not hard to be found:

F(EFFCOLNUMS)–F(EFFCOLNUMS-1) = 20^(1-FLTCOLNUM)

Modify formula 13 base on this rule:

Formula 14: TYPFAC = 130+(20*EFFCOLNUM-100)*(20^(-FILTCLNUM))+MAXFLTCOLPOS*20+(5-MAXFLTCOLPOS)*20^(1-FLTCOLNUM)+ COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)

       = 130 + EFFCOLNUM*20^(1-FILTCOLNUM) - 5*20^(1-FILTCOLNUM) + MAXFLTCOLPOS*20 + (5-MAXFLTCOLPOS)*20^(1-FLTCOLNUM) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)

       = 130 + EFFCOLNUM*20^(1-FILTCOLNUM) + MAXFLTCOLPOS*(20 - 20^(1-FLTCOLNUM))) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)

       = 130 + EFFCOLNUM*20^(1-FILTCOLNUM) + MAXFLTCOLPOS*(20 - 20^(1-FLTCOLNUM))) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)

       = 130 + EFFCOLNUM*20^(1-FILTCOLNUM) + MAXFLTCOLPOS*(20 - 20^(1-FLTCOLNUM))) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)

Consider formula 8-1, it can be transformed as:

Formula 15: TYPFAC = 130 + (EFFQRYCOLNUM + MAXFLTCOLPOS)*20^(1-FILTCOLNUM) + MAXFLTCOLPOS*(20 - 20^(1-FLTCOLNUM))) + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)

       = 130 + EFFQRYCOLNUM*(1/20)^(FILTCOLNUM-1) + MAXFLTCOLPOS*20+ COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)

       = 130 + MAXFLTCOLPOS*20+ COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1) + 20*EFFQRYCOLNUM*(1/20)^FILTCOLNUM

3.4.13.           Filter columns computing sequence (with mixed filter column data type)

We mentioned before, if the data types of 2 columns be changed in turn, it is difficult to see the rule similar to previous data. So I guess the optimizer will compute the filter columns in a special sequence (it’s the truth, can be controlled by hints, which is out of range of our article), to get the value of the power (the n in COLTYPEFAC2 * (1 / 20) ^ n) above.
A closer observation on the test data with mixed data types, we can still find many other rules (two columns in this example):

COL1

COL2

MAXFLTPOS

COST_CPU

varchar2

date

3

262221440

number

char

4

274671440

date

varchar2

5

302121440

char

varchar2

1

269671440

varchar2

number

2

294621440

… …

… …

Then put the above data into the formula 12, we found such a rule: when mix CHAR/VARCHAR2 with NUMBER/DATE, the CHAR/VARCHAR2 column has the smaller power value; when fix NUMBER with DATE, NUMBER column will have the smaller power value. Considering their COLTYPEFAC, we can draw such a assumption 9: Optimizer will calculate the AND filter columns in their COLTYPEFAC sequence, from small to big, increasing power value start from 1.

I adopted a more complex combination of mixed data type columns to prove this assumption:

COLA

COLB

COLC

COLD

COLE

MAXFLTPOS

COST_CPU

VARCAHR2

number

date

varchar2

date

5

290035815

Number

varchar2

date

date

varchar2

5

290035815

In these 2 copies of data, in spite of the physical location of the column, they have the same set of data types: (VARCHAR2, NUMBER, DATE, VARCHAR2, DATE). According to assumption 9, their uniform sequence is (DATE, DATE, NUMBER, VARCHAR2, VARCHAR2), therefore calculated the same TYPEFAC:
130 +20 * 5 +50 * (1 / 20) ^ 0 +50 * (1 / 20) ^ 1 +150 * (1 / 20) ^ 2 +300 * (1 / 20) ^ 3 +300 * (1 / 20) ^ 4 = 282.9144 = 282.914375
The results is same the results calculate from formula 5:
(290035815 - (4500/8192 +0.32) * 1000 * 81929) / 1000000 = 282.914375
Through other random data, we can find that the assumption 9 is valid.

Note: In fact, this sequence rule is not the finally rule, it’s the one we can find under current circumstance. We will find the real rule when I involved more complex conditions.

3.4.14.           Filter operator (=)

Prior to that, our filter operator is > or <. How about to change it to other operator? We first try on =.

3.4.14.1.             Distinct value number of the column

Use the same method and process as > operator, it comes to the formula (here, I did not take query column whose position is greater than the max filter column position column into account, because it’s relatively complexity, I will analysis it in further later):
130 + MAXFLTCOLPOS * 20 + COLTYPEFAC1 * (1 / 1600) ^ 0 + COLTYPEFAC2 * (1 / 1600) ^ 1 + ... + COLTYPEFACn * (1 / 1600) ^ (FLTCOLNUM-1)
Compare to the formula of >, the obvious difference is filter expression in the sub-base, form 1/20 to 1/1600. By reference to the data of baseline table, the result is easily let me associate another test I eve did: column selectivity (see the article 查询计划中Cardinality的计算). In that test, there is a preliminary conclusion: when use bind variables, the selectivity of >, <, <= and >= is 1/20, and the selectivity of = is 1/NDV. Here 1600 is the uniform number of distinct values (NDV) of all of the columns.
By modifying the NDV, it proves that this assumption is valid. Formula can therefore be changed as:

Formula 16: 130 + MAXFLTCOLPOS*20+ COLTYPEFAC1*(1/NDV)^0 + COLTYPEFAC2*(1/NDV)^1 + … + COLTYPEFACn*(1/NDV)^(FLTCOLNUM-1)

3.4.14.2.             Mixed NDV

Because the all of the columns of the base table have been set to the same NDV, after come out the formula 16, I naturally think of another question: If the filters have the different NDV, is sub-expression also different? Still look at data with various columns NDV:

A:160

B:80

C:100

D:200

E:400

FILTER

CPU_B

CPU_A

TYPFAC

CACULATED

a = :v1 b = :v2

478996440

7121440

471.875

471.875

c = :v3 b = :v2

500121440

7121440

493

493

a = :v1 b = :v2 c = :v3

499015190

7121440

491.89375

491.89375

b = :v2 d = :v4 e = :v5

537875190

7121440

530.75375

530.75375

a = v1 b = :v2 c = :v4 e = :v5

537876174

7121440

530.75473

530.7547344

c = v3 b = :v2 d = :v4 e = :v5

537875228

7121440

530.75379

530.7537875

a = :v1 c = v3 b = :v2 d = :v4 e = :v5

537875214

7121440

530.75377

530.7537737

… …

From these data, the rule can be found: the effective NDV on each filter column is product of NDV of the columns prior to it. For example, under the assumption 9, the fifth data above: the effective NDV of the first column of is 1 (none previous column), the effective NDV of the second column is 400, the effective NDV of the third column is 400 * 200 ... ..., and its computing expression as follows:

30 + 5 * 20 + 300 * (1 / 1) + 300 * (1 / 400) + 300 * (1 / 400) * (1 / 200) + 300 * (1 / 400) * (1 / 200) * (1 / 160) + 300 * (1 / 400) * (1 / 200) * (1 / 160) * (1 / 100) = 530.7537875

According to this rule, the formula changes as follows:

Formula 16: 130 + MAXFLTCOLPOS*20+ COLTYPEFAC1*(1/1) + COLTYPEFAC2*(1/MAXNDV1) + … + COLTYPEFACn*(1/MAXNDV1)*…*(1/MAXNDVn-1)

Note: the formula can also be applied to <, >, while its NDV is 20.

3.4.14.3.             Mixed data types

In the derivation before, we have confirmed columns have different data types, it will calculate the filter column by the sequence based on data type factors, from small to big. However, if there are the mixed data types with mixed NDV, what will be the result? Modify the data type of the base table above; get 2 sets of test data:

A:VARCHAR2

B:DATE

C:NUMBER

D:NUMBER

E:VARCHAR2

FILTER

CPU_B

CPU_A

TYPFAC

CACULATED

a = :v1 b = :v2

228996440

7121440

221.875

221.875

c = :v3 b = :v2

350121440

7121440

343

343

a = :v1 b = :v2 c = :v3

248077690

7121440

240.95625

240.95625

b = :v2 d = :v4 e = :v5

287500190

7121440

280.37875

280.37875

a = v1 b = :v2 c = :v4 e = :v5

287248831

7121440

280.12739

280.1273906

c = v3 b = :v2 d = :v4 e = :v5

287498353

7121440

280.37691

280.3769125

a = :v1 c = v3 b = :v2 d = :v4 e = :v5

287248796

7121440

280.12736

280.1273557

A:VARCHAR2

B:NUMBER

C:DATE

D:DATE

E:NUMBER

FILTER

CPU_B

CPU_A

TYPFAC

CACULATED

a = :v1 b = :v2

228058940

7121440

220.9375

220.9375

c = :v3 b = :v2

350871440

7121440

343.75

343.75

a = :v1 b = :v2 c = :v3

248082378

7121440

240.96094

240.9609375

b = :v2 d = :v4 e = :v5

387505815

7121440

380.38438

380.384375

a = v1 b = :v2 c = :v4 e = :v5

288061342

7121440

280.9399

280.9399023

c = v3 b = :v2 d = :v4 e = :v5

387505862

7121440

380.38442

380.3844219

a = :v1 c = v3 b = :v2 d = :v4 e = :v5

288061343

7121440

280.9399

280.9399026

From these data, a rule can be get: the compute sequence is first considering COLTYPEFAC, from small to large, and then in accordance with selectivity (1/NDV), from small to large, the effective NDV on each filter column is product of NDV of the columns prior to it. For example, the fifth of the second set data above, the expression is:

130+5*20+150*1+150*(1/400)+300*(1/400)*(1/80)+300*(1/400)*(1/80)*(1/200) = 280.9399026

3.4.14.4.             Effective query column

We have said that the column whose location position larger than the max filter column position is effective query column. Operated by =, effective query column will be more complex (in fact, it’s also the case in the < formula, but can not easily be found). Look back to the effective column in the < query formula first:

TYPFAC = 130 + MAXFLTCOLPOS*20+ COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1) + 20*EFFQRYCOLNUM*(1/20)^FILTCOLNUM

By comparing the = formula, we can image the (1/20) in the expression 20*EFFQRYCOLNUM*(1/20)^FILTCOLNUM is the selectivity of the effective query columns. The testing data prove it’s right:

FLTCOLUMNS

2

MAXFLTCOLPOS

3

Density: 1/40

EFFQRYCOLNUM

COST_CPU

CPU_B

CPU_A

TYPFAC

CACULATED

1

504633940

7121440

497512500

497.5125

0.0125

2

504646440

7121440

497525000

497.525

0.025

3

504658940

7121440

497537500

497.5375

0.0375

4

504671440

7121440

497550000

497.55

0.05

However, we find some special case soon,

MAXFLTPOS: 3

Density

1/200

FLTCOLUMNS

3

EFFQRYCOLNUM

COST_CPU

CPU_B

CPU_A

TYPFAC

CACULATED

1

498628960

7121440

491507520

491.50752

2E-05

2

498628980

7121440

491507540

491.50754

2E-05

3

498629000

7121440

491507560

491.50756

2E-05

4

498629020

7121440

491507580

491.50758

2E-05

MAXFLTPOS

3

Density

1/800

FLTCOLUMNS

2

EFFQRYCOLNUM

COST_CPU

CPU_B

CPU_A

TYPFAC

1

497496480

7121440

490375040

490.37504

4E-05

2

497496520

7121440

490375080

490.37508

4E-05

Studied these cases, I found another rule: the increasing will stop at a threshold. How can we get this threshold? Take it easy, ignore it first, we look at another factor first.

3.4.14.5.             Table records number

In the derivation of </> formula, we have come to a conclusion: NDV and the records number of the table will not affect COST_CPU. However, in the = formula, NDV has become proved to be one of the factors, then, we should re-inspected the number records of table.

Note: I have ever tested lots of cases to identify the factor, and will just list a few of them as example:

rows

1000

tabblks

100

Density

1/80

FLTCOLUMNS

1

EFFQRYCOLNUM

COST_CPU

CPU_B

CPU_A

TYPFAC

CACULATED

1

1202404

712144

490260

490.26

0.26

2

1202664

712144

490520

490.52

0.26

3

1202924

712144

490780

490.78

0.26

Base on these, I finally get the expression of effective query column in the formula:

20*ROUND((1/NDV)^FLTCOLNUM*TABROWS,0)/TABROWS

After this, we can continue the job on the threshold; it can be drawn as follow,

20*MAX(1,ROUND((1/NDV)^FLTCOLNUM*TABROWS,0))/TABROWS

Formula 16 is transformed as,

Formula 17: TYPFAC = 130 + MAXFLTCOLPOS*20+ COLTYPEFAC1*(1/1) + COLTYPEFAC2*(1/MAXNDV1) + … + COLTYPEFACn*(1/MAXNDV1)*…*(1/MAXNDVn-1) + 20*EFFQRYCOLNUM*MAX(1,ROUND((1/NDV)^FLTCOLNUM*TABROWS,0))/TABROWS

However, the NDV of all columns in these cases are set as same, we should consider their difference now.

A:160

B:80

C:100

D:200

E:400

A:VARCHAR2

B:NUMBER

C:DATE

D:DATE

E:NUMBER

MAXFILPOS

4

EFFCOLNUMS

4

EFFQRYCOLNUM

COST_CPU

CPU_B

TYPFAC

CACULATED

1

275409060

7121440

268.28762

0.00012

2

275409180

7121440

268.28774

0.00012

MAXFILPOS

5

EFFCOLNUMS

5

1

295035835

7121440

287.914395

0.00002

2

295035855

7121440

287.914415

0.00002

The rule is not hard to be found: the effective NDV of effective query columns is the product of the NDV of all filter columns. Base on it, the formula become,

Formula 18: TYPFAC = 130 + MAXFLTCOLPOS*20+ COLTYPEFAC1*(1/1) + COLTYPEFAC2*(1/MAXNDV1) + … + COLTYPEFACn*(1/MAXNDV1)*…*(1/MAXNDVn-1) + 20*EFFQRYCOLNUM*MAX(1,ROUND((1/MAXNDV1)*…*(1/MAXNDVn)*TABROWS,0))/TABROWS

Note: this formula can also be applied to </> case. Base table records number I used before is large (1000000), it quire more than 5 filter columns (ROUND ((1 / 20) ^ 5 * 1000000,0) = 0 <1) to get the threshold, so the threshold had not been noted. We can confirm it also exists in </> case by testing.

Previous：Full Table Scan cost formula cracking(5)

Next：Full Table Scan cost formula cracking(3)

Return To this Category