About Clusters

A cluster provides an optional method of storing table data. A cluster is made up of a group of tables that share the same data blocks. The tables are grouped together because they share common columns and are often used together. For example, the emp and dept table share the deptno column. When you cluster the emp and dept tables (see Figure 18-1), Oracle Database physically stores all rows for each department from both the emp and dept tables in the same data blocks.

Because clusters store related rows of different tables together in the same data blocks, properly used clusters offer two primary benefits:

Disk I/O is reduced and access time improves for joins of clustered tables.
The cluster key is the column, or group of columns, that the clustered tables have in common. You specify the columns of the cluster key when creating the cluster. You subsequently specify the same columns when creating every table added to the cluster. Each cluster key value is stored only once each in the cluster and the cluster index, no matter how many rows of different tables contain the value.

Therefore, less storage might be required to store related table and index data in a cluster than is necessary in non-clustered table format. For example, in Figure 18-1, notice how each cluster key (each deptno) is stored just once for many rows that contain the same value in both the emp and dept tables.

After creating a cluster, you can create tables in the cluster. However, before any rows can be inserted into the clustered tables, a cluster index must be created. Using clusters does not affect the creation of additional indexes on the clustered tables; they can be created and dropped as usual.

You should not use clusters for tables that are frequently accessed individually.

　索引簇的工作原理：

聚簇：如果一组表有一些共同的列，则将这样一组表存储在相同的数据库块中;聚簇还表示把相关的数据存储在同一个块上。利用聚簇，一个块可能包含多个表的数据。概念上就是如果两个或多个表经常做连接操作，那么可以把需要的数据预先存储在一起。聚簇还可以用于单个表，可以按某个列将数据分组存储。

　　更加简单的说，比如说，EMP表和DEPT表，这两个表存储在不同的segment中，甚至有可能存储在不同的TABLESPACE中，因此，他们的数据一定不会在同一个BLOCK里。而我们有会经常对这两个表做关联查询，比如说：select * from emp,dept where emp.deptno = dept.deptno .仔细想想，查询主要是对BLOCK的操作，查询的BLOCK越多，系统IO就消耗越大。如果我把这两个表的数据聚集在少量的BLOCK里，查询效率一定会提高不少。

　　比如我现在将值deptno=10的所有员工抽取出来，并且把对应的部门信息也存储在这个BLOCK里(如果存不下了，可以为原来的块串联另外的块)。这就是索引聚簇表的工作原理。

案例分析：

创建簇的 格式

    CREATE CLUSTER cluster_name
    (column date_type [,column datatype]...)
    [PCTUSED 40 | integer] [PCTFREE 10 | integer]
    [SIZE integer]
    [INITRANS 1 | integer] [MAXTRANS 255 | integer]
    [TABLESPACE tablespace]
    [STORAGE storage]

SIZE：指定估计平均簇键，以及与其相关的行所需的字节数。

1) 普通表连接查询

		10:06:37 SYS@ test1>conn scott/tiger

Connected.

11:47:08 SCOTT@ test1 >select e.ename,e.sal,e.deptno,d.dname,d.loc from emp e ,dept d

11:48:21   2   where e.deptno=d.deptno and d.deptno=10;

ENAME             SAL     DEPTNO DNAME          LOC

---------- ---------- ---------- -------------- -------------

CLARK            2450         10 ACCOUNTING     NEW YORK

KING             5000         10 ACCOUNTING     NEW YORK

MILLER           1300         10 ACCOUNTING     NEW YORK

执行计划：

Execution Plan

----------------------------------------------------------

Plan hash value: 568005898

----------------------------------------------------------------------------------------

| Id  | Operation                    | Name    | Rows  | Bytes | Cost (%CPU)| Time     |

----------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT             |         |     3 |    99 |     4   (0)| 00:00:01 |

|   1 |  NESTED LOOPS                |         |     3 |    99 |     4   (0)| 00:00:01 |

|   2 |   TABLE ACCESS BY INDEX ROWID| DEPT    |     1 |    20 |     1   (0)| 00:00:01 |

|*  3 |    INDEX UNIQUE SCAN         | PK_DEPT |     1 |       |     0   (0)| 00:00:01 |

|*  4 |   TABLE ACCESS FULL          | EMP     |     3 |    39 |     3   (0)| 00:00:01 |

----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

   3 - access("D"."DEPTNO"=10)

   4 - filter("E"."DEPTNO"=10)

Statistics

----------------------------------------------------------

          1  recursive calls

          0  db block gets

         10  consistent gets

          0  physical reads

          0  redo size

        766  bytes sent via SQL*Net to client

        419  bytes received via SQL*Net from client

          2  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

          3  rows processed

查看emp和dept表所在的块：                    

17:46:11 SCOTT@ test1 >select dbms_rowid.ROWID_OBJECT(rowid) "OBJ",DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID) "BLOCK_ID" FROM EMP WHERE ROWNUM=1;

       OBJ   BLOCK_ID

---------- ----------

     17571        149

Elapsed: 00:00:00.01

17:46:40 SCOTT@ test1 >select dbms_rowid.ROWID_OBJECT(rowid) "OBJ",DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID) "BLOCK_ID" FROM DEPT WHERE ROWNUM=1;

       OBJ   BLOCK_ID

---------- ----------

     17569        133

对于emp和dept表，属于不同的object，数据存储在不同的数据块上。

2）簇表查询

		------建立簇

create cluster dept_emp_clu (deptno number(3))

  pctfree 20 pctused 60

  size 500 tablespace users;

Size选项

  ：是用来告诉Oracle预计有500字节数据和每个cluser key相关。Oracle将使用这个信息来计算每个block能容纳的最大cluster key数目。因此size太高，在每一block将得到很少的key，并且将使用比需要的更多的空间；设置容量太低，将得到过多的数据连接，这将偏离使用cluster的目的。Size是cluster的重要参数。

---- 建立簇表

  create table department(

  id number(3) primary key,

  dname varchar(14) ,loc varchar2(13))

  cluster dept_emp_clu(id);

  create table employee(

  eno number(4) primary key ,

  ename varchar2(10),

  job varchar2(9),

  mgr number(4),

  hiredate date,

  sal number(7,2),

  comm number(7,2),

  dept_id number(3) references department

  ) cluster dept_emp_clu(dept_id);

-----在簇上建立索引

  create index dept_emp_idx on cluster dept_emp_clu

   tablespace indx;

11:49:43 SCOTT@ test1 >analyze table department compute statistics;

Table analyzed.

11:50:15 SCOTT@ test1 >analyze table employee compute statistics;

Table analyzed.

11:50:31 SCOTT@ test1 >select e.ename,e.sal,e.DEPT_ID,d.dname,d.loc from employee e,department d

11:50:39   2  where e.dept_id=d.id and d.id=10; 

ENAME             SAL    DEPT_ID DNAME          LOC

---------- ---------- ---------- -------------- -------------

CLARK            2450         10 ACCOUNTING     NEW YORK

KING             5000         10 ACCOUNTING     NEW YORK

MILLER           1300         10 ACCOUNTING     NEW YORK

执行计划：

Execution Plan

----------------------------------------------------------

Plan hash value: 2165989181

---------------------------------------------------------------------------------------------

| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     |

---------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT             |              |     5 |   140 |     2   (0)| 00:00:01 |

|   1 |  NESTED LOOPS                |              |     5 |   140 |     2   (0)| 00:00:01 |

|   2 |   TABLE ACCESS BY INDEX ROWID| DEPARTMENT   |     1 |    18 |     1   (0)| 00:00:01 |

|*  3 |    INDEX UNIQUE SCAN         | SYS_C005404  |     1 |       |     0   (0)| 00:00:01 |

|   4 |   TABLE ACCESS CLUSTER       | EMPLOYEE     |     5 |    50 |     1   (0)| 00:00:01 |

|*  5 |    INDEX UNIQUE SCAN         | DEPT_EMP_IDX |     1 |       |     0   (0)| 00:00:01 |

---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

   3 - access("D"."ID"=10)

   5 - access("E"."DEPT_ID"=10)

Statistics

----------------------------------------------------------

          1  recursive calls

          0  db block gets

          5  consistent gets

          0  physical reads

          0  redo size

        767  bytes sent via SQL*Net to client

        419  bytes received via SQL*Net from client

          2  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

          3  rows processed

   对于Cluster table连接访问，可以看到“5  consistent gets“，比普通表”10  consistent gets“减少了一半的一致性读。

查询簇表所在的块：

10:16:53 SYS@ test1>conn scott/tiger

Connected.

10:19:04 SCOTT@ test1>select rowid,ename from employee where rownum =1;

ROWID              ENAME

------------------ ----------

AAACiCAAEAAAACEAAA SMITH

10:19:05 SCOTT@ test1>select rowid,dname from department where rownum=1;

ROWID              DNAME

------------------ --------------

AAACiCAAEAAAACEAAA ACCOUNTING

17:49:17 SCOTT@ test1 >select dbms_rowid.ROWID_OBJECT('AAACiCAAEAAAACEAAA') "OBJ",DBMS_ROWID.ROWID_BLOCK_NUMBER('AAACiCAAEAAAACEAAA') "BLOCK_ID" FROM DUAL;

       OBJ   BLOCK_ID

---------- ----------

     10370        132

------因为簇表的数据放在相同的块上，所以在表连接查询时，减少了I/O

散列聚簇表

		在簇表中，Oracle使用存储在索引中的键值来定位表中的行，而在散列聚簇表中，使用了散列函数代替了簇索引，先通过内部函数或者自定义的函数进行散列计算，然后再将计算得到的码值用于定位表中的行。创建散列簇需要用到HASHKEYS子句。     

1、创建散列簇     

create cluster my_clu_two(empno number(10) )    

pctused 70    

pctfree 10    

tablespace users    

hash is empno    

hashkeys 150 ;     

说明：    

* hash is 子句指明了进行散列的列，如果列是唯一的标示行，就可以将列指定为散列值    

* hashkeys 指定和限制散列函数可以产生的唯一的散列值的数量     

2、创建散列表     

create table t2_emp (      

empno number ( 10 ),      

ename varchar2 ( 20 ),      

birth_date date ,      

deptno number )    

cluster my_clu_two(empno);     

注意：    

* 必须设置数值的精度(具体原因不详)    

* 散列簇不能也不用创建索引    

* 散列簇不能ALTER:size、hashkeys、hash is参数

案例分析：

当数据存储在一个普通表中的时候，这些记录将以插入到数据库时的顺序物理地保存到分配的块中。例如，如果有一个用于存储员工信息的表，那么员工姓名将会按照插入到表的顺序存储在表中。　　

　　如果员工记录非常多的话，那么数据表的响应速度就会逐渐变慢。你可以通过选择值相对等分布的一列（如员工的部门编号）并建立一个簇表来提高查询员工的速度。

　　在簇表中，如果员工属于同一个部门，那么它们的记录将物理地存储在同一系列的块中。这样就可以提高查找员工信息的速度，这是因为在检索某个特定部门的员工时，需要读取数据库块的数量减少了。而在非簇表中查找员工，就可能需要对每个数据库块进行访问。

　　当表中存在大量键值的时候，你就会开始发现由于存在许多簇块而导致的性能问题。避免这个问题的一个方法就是使用一个哈希函数来约束簇块的数量。哈希函数将会给定一个数值用来限定簇块数量的预计范围，但它得到的值是相对等分布的。例如你可以创建一个哈希函数,只比较部门编号的最后两位。

　　哈希函数中存在的一个问题就是函数值会打乱记录原本的顺序。你可以通过 ORDER BY来解决这个问题；但是，在很多情况下，记录数量是非常庞大的。在Oracle 10g 中，你可以将一个数据定义为“natural order” ，那么就可以不用经过排序而以你所希望的顺序来检索哈希簇的数据，从而解决了上面的提出问题。

　　例如，假设你有一个信用卡业务的数据库。你决定以信用卡号作为簇主键将有利于数据的存储分布。但是，由于存在大量的信用卡号，所以可以使用一个哈希函数来约束簇块的数量。而且你希望在你的大部分报表中数据是按照时间顺序排列的，那么在进行每个查询操作时使用排序哈希簇，而不要使用ORDER BY。

下面给出了相关语句：

		10:54:06 SCOTT@ test1 >create cluster credit_cluster(

card_no varchar2(16),

transdate date sort) 

hashkeys 10000 

hash is ora_hash(card_no) 

size 256;

Cluster created.

10:55:25 SCOTT@ test1 >create table credit_orders

(card_no varchar2(16),

transdate date,

amount number) 

cluster credit_cluster(card_no,transdate);

Table created.

10:56:10 SCOTT@ test1 >select * from tab;

TNAME                          TABTYPE  CLUSTERID

------------------------------ ------- ----------

BONUS                          TABLE

CREDIT_CLUSTER                 CLUSTER

CREDIT_ORDERS                  TABLE            1

10:56:15 SCOTT@ test1 >alter session set nls_date_format = "YYYYMMDDHH24MISS";

Session altered.

10:56:33 SCOTT@ test1 >insert into credit_orders (card_no,transdate,amount) values ('4111111111111111','20050131000123',57.99);

1 row created.

10:57:24 SCOTT@ test1 >insert into credit_orders (card_no,transdate,amount) values ('4111111111111111','20050130071216',16.59);

1 row created.

10:57:33 SCOTT@ test1 >insert into credit_orders (card_no,transdate,amount) values ('4111111111111111','20050131111111',39.00);

1 row created.

10:57:45 SCOTT@ test1 >insert into credit_orders (card_no,transdate,amount) values ('4111111111111111','20050130081001',25.16);

1 row created.

11:12:24 SCOTT@ test1 >insert into credit_orders (card_no,transdate,amount) values ('5111111111111111','20050131000123',57.99);

1 row created.

11:16:16 SCOTT@ test1 >insert into credit_orders (card_no,transdate,amount) values ('5111111111111111','20050130071216',16.59);

1 row created.

11:16:36 SCOTT@ test1 >insert into credit_orders (card_no,transdate,amount) values('5111111111111111','20050131111111',39.00);

1 row created.

11:16:57 SCOTT@ test1 >commit;

Commit complete.

11:17:02 SCOTT@ test1 >select * from credit_orders;

CARD_NO          TRANSDATE          AMOUNT

---------------- -------------- ----------

4111111111111111 20050131000123      57.99

4111111111111111 20050130071216      16.59

4111111111111111 20050131111111         39

4111111111111111 20050130081001      25.16

5111111111111111 20050131000123      57.99

5111111111111111 20050130071216      16.59

5111111111111111 20050131111111         39

7 rows selected.

10:58:14 SCOTT@ test1 >SET AUTOTRACE ON

10:59:53 SCOTT@ test1 >select * from credit_orders where card_no=4111111111111111;

CARD_NO          TRANSDATE          AMOUNT

---------------- -------------- ----------

4111111111111111 20050131000123      57.99

4111111111111111 20050130071216      16.59

4111111111111111 20050131111111         39

4111111111111111 20050130081001      25.16

Execution Plan

----------------------------------------------------------

Plan hash value: 1026124641

-----------------------------------------------------------------------------------

| Id  | Operation         | Name          | Rows  | Bytes | Cost (%CPU)| Time     |

-----------------------------------------------------------------------------------

|   0 | SELECT STATEMENT  |               |     1 |    32 |    87   (0)| 00:00:02 |

|*  1 |  TABLE ACCESS FULL| CREDIT_ORDERS |     1 |    32 |    87   (0)| 00:00:02 |

-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

   1 - filter(TO_NUMBER("CARD_NO")=4111111111111111)

Note

-----

   - dynamic sampling used for this statement (level=2)

Statistics

----------------------------------------------------------

         27  recursive calls

          0  db block gets

        418  consistent gets

          0  physical reads

          0  redo size

        667  bytes sent via SQL*Net to client

        415  bytes received via SQL*Net from client

          2  SQL*Net roundtrips to/from client

          2  sorts (memory)

          0  sorts (disk)

          4  rows processed

11:00:32 SCOTT@ test1 >analyze

11:01:10   2  table credit_orders compute statistics;

Table analyzed.

11:01:30 SCOTT@ test1 >select * from credit_orders where card_no=4111111111111111;

CARD_NO          TRANSDATE          AMOUNT

---------------- -------------- ----------

4111111111111111 20050131000123      57.99

4111111111111111 20050130071216      16.59

4111111111111111 20050131111111         39

4111111111111111 20050130081001      25.16

Elapsed: 00:00:00.02

Execution Plan

----------------------------------------------------------

Plan hash value: 1026124641

-----------------------------------------------------------------------------------

| Id  | Operation         | Name          | Rows  | Bytes | Cost (%CPU)| Time     |

-----------------------------------------------------------------------------------

|   0 | SELECT STATEMENT  |               |     3 |    72 |    87   (0)| 00:00:02 |

|*  1 |  TABLE ACCESS FULL| CREDIT_ORDERS |     3 |    72 |    87   (0)| 00:00:02 |

-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

   1 - filter(TO_NUMBER("CARD_NO")=4111111111111111)

Statistics

----------------------------------------------------------

         12  recursive calls

          0  db block gets

        352  consistent gets

          0  physical reads

          0  redo size

        667  bytes sent via SQL*Net to client

        415  bytes received via SQL*Net from client

          2  SQL*Net roundtrips to/from client

          1  sorts (memory)

          0  sorts (disk)

          4  rows processed

我们可以通过创建一个单表的hash cluster，将相同键值的数据物理存放在一起，达到提高性能的目的。创建cluster有两个最重要的参数：hashkeys和size，前者表示cluster中有多少个不同的键值，后者表示每个键值需要分配的空间。因为hash cluster的空间是预先分配的，这两个值的正确设置对cluster的性能影响非常大。hashkeys设置过大，会造成空间浪费，而如果设置过小，则会产生大量的hash碰撞，极大影响性能。size也是一样，设置过大会浪费空间，而设置过小，数据超过预先分配的空间时，会通过链接方式存放在溢出段中，影响性能。会导致块溢出（同一个hash key分布在不同的block中），从而导致IO增大。而这两个值一旦设置，就无法更改，除非重建cluster。

hash cluster简单的说就是通过预先分配空间的方式，将相同key的数据存放在一起，以提高查询性能的一种手段，所以准确的设置hashkeys和size参数是使用hash cluster的关键，使用的前提是key的数量是可以估算的，而且每个key的数据是基本平均的。但是，在实际使用的环境中，数据量的变化往往是不可预知的，这也造成hash cluster的应用场景非常有限。

簇表的应用：

什么时候不应该使用聚簇：

　　1) 如果预料到聚簇中的表会大量修改：必须知道，索引聚簇会对DML的性能产生某种负面影响(特别是INSERT语句)。管理聚簇中的数据需要做更多的工作。

　　2) 如果需要对聚簇中的表执行全表扫描：不只是必须对你的表中的数据执行全面扫描，还必须对(可能的)多个表中的数据进行全面扫描。由于需要扫描更多的数据，所以全表扫描耗时更久。

　　3) 如果你认为需要频繁地TRUNCATE和加载表：聚簇中的表不能截除。这是显然的，因为聚簇在一个块上存储了多个表，必须删除聚簇表中的行。

　　因此，如果数据主要用于读(这并不表示“从来不写”;聚簇表完全可以修改)，而且要通过索引来读(可以是聚簇键索引，也可以是聚簇表上的其他索引)，另外会频繁地把这些信息联结在一起，此时聚簇就很适合。

补充关于需要使用聚簇表的情况：

考虑对经常在连接语句中访问的表建立聚簇。

如果表只是偶尔被连接或者它们的公共列经常被修改，则不要聚簇表。（修改记录的聚簇键值比在非聚簇的表中修改此值要花费更多的时间，因为Oracle 必须将修改的记录移植到其他的块中以维护聚簇）。

如果经常需要在一个表上进行完全搜索，则不要聚簇这个表（对一个聚簇表进行完全搜索比在非聚簇表上进行完全搜索的时间长，Oracle 可能要读更多的块，因为表是被一起存储的。）

如果经常从一个父表和相应的子表中查询记录，则考虑给1 对多（1:* ）关系创建聚簇表。（子表记录存储在与父表记录相同的数据块中，因此当检索它们时可以同时在内存中，因此需要Oracle 完成较少的I/O ）。

如果经常查询同一个父表中的多个子记录，则考虑单独将子表聚簇。（这样提高了从相同的父表查询子表记录的性能，而且也没有降低对父表进行完全搜索的性能）。

如果从所有有相同聚簇键值的表查询的数据超过一个或两个Oracle 块，则不要聚簇表。（要访问在一个聚簇表中的记录，Oracle 读取所有包含那个记录值的全部数据块，如果记录占据了多个数据块，则访问一个记录需要读的次数比一个非聚簇的表中访问相同的记录读的次数要多）。

使用哈希聚簇指南：

当经常使用有相同列的包含相等条件的查询子句访问表时，考虑使用哈希聚簇来存储表。使用这些列作为聚簇键。

如果可以确定存放具有给定聚簇键值的所有记录所需的空间（包括现在的和将来的），则将此表以哈希聚簇存储。

如果空间不够，并且不能为将要插入的新记录分配额外的空间，那么不要使用哈希聚簇。

如果偶尔创建一个新的、很大的哈希聚簇来保存这样的表是不切实际的，那么不要用哈希聚簇存储经常增长的表。

如果经常需要进行全表搜索，并且必须要为表的预期增长中的哈希聚簇分配足够的空间，则不要将此表以哈希聚簇存储。（这样的完全检索必须要读分配给哈希聚簇的全部块，即使有些块可能只包含很少的记录。单独地存储表将减少由完全的表检索读取的块的数量。）

如果你的应用程序经常修改聚簇键的值，则不要将表以哈希聚簇方式存储。

不管这个表是否经常与其他表连接，只要进行哈希对于基于以前的指南的表是合适的，那么在哈希聚簇中存储一个表可能是有用的。

通过案例学调优之--Oracle Cluster Table

About Clusters