Oracle7 Server Administrator's Guide

Creating Hash Clusters

After a hash cluster is created, tables can be created in the cluster. A hash cluster is created using the SQL command CREATE CLUSTER. For example, the following statement creates a cluster named TRIAL_CLUSTER that stores the TRIAL table, clustered by the TRIALNO column:

CREATE CLUSTER trial_cluster (trialno NUMBER(5,0))

   PCTUSED 80

   PCTFREE 5

   TABLESPACE users

   STORAGE (INITIAL 250K     NEXT 50K

      MINEXTENTS 1     MAXEXTENTS 3

      PCTINCREASE 0)

   SIZE 2K

   HASH IS trialno HASHKEYS 150;

CREATE TABLE trial (

   trialno         NUMBER(5,0) PRIMARY KEY,

   ...)

   CLUSTER trial_cluster (trialno);

The following sections explain setting the parameters of the CREATE CLUSTER command specific to hash clusters.

See Also: For additional information about creating tables in a cluster, guidelines for setting other parameters of the CREATE CLUSTER command, and the privileges required to create a hash cluster, see "Creating Clusters" .

Controlling Space Use Within a Hash Cluster

When creating a hash cluster, it is important to choose the cluster key correctly and set the HASH IS, SIZE, and HASHKEYS parameters so that performance and space use are optimal. The following guidelines describe how to set these parameters.

Choosing the Key

Choosing the correct cluster key is dependent on the most common types of queries issued against the clustered tables. For example, consider the EMP table in a hash cluster. If queries often select rows by employee number, the EMPNO column should be the cluster key; if queries often select rows by department number, the DEPTNO column should be the cluster key. For hash clusters that contain a single table, the cluster key is typically the entire primary key of the contained table.

The key of a hash cluster, like that of an index cluster, can be a single column or a composite key (multiple column key). A hash cluster with a composite key must use Oracle's internal hash function.

Setting HASH IS

Only specify the HASH IS parameter if the cluster key is a single column of the NUMBER datatype, and contains uniformly distributed integers. If the above conditions apply, you can distribute rows in the cluster so that each unique cluster key value hashes, with no collisions, to a unique hash value. If these conditions do not apply, omit this option so that you use the internal hash function.

Setting SIZE

SIZE should be set to the average amount of space required to hold all rows for any given hash key. Therefore, to properly determine SIZE, you must be aware of the characteristics of your data:

If the hash cluster is to contain only a single table and the hash key values of the rows in that table are unique (one row per value), SIZE can be set to the average row size in the cluster.

If the hash cluster is to contain multiple tables, SIZE can be set to the average amount of space required to hold all rows associated with a representative hash value.

See Also: To estimate a preliminary value for SIZE, follow the procedures given in "Estimating Space Required by Hash Clusters"

. If the preliminary value for SIZE is small (more than four hash keys can be assigned per data block), you can use this value for SIZE in the CREATE CLUSTER command.

However, if the value of SIZE is large (fewer than five hash keys can be assigned per data block), you should also consider the expected frequency of collisions and whether performance of data retrieval or efficiency of space usage is more important to you:

If the hash cluster does not use the internal hash function (if you specified HASH IS) and you expect little or no collisions, you can set SIZE as estimated; no collisions occur and space is used as efficiently as possible.

If you expect frequent collisions on inserts, the likelihood of overflow blocks being allocated to store rows is high. To reduce the possibility of overflow blocks and maximize performance when collisions are frequent, you should increase SIZE according to Table 15 - 1.

Available Space per Block/Calc`d SIZE	Setting for SIZE
1	Calculated SIZE
2	Calculated SIZE + 15%
3	Calculated SIZE + 12%
4	Calculated SIZE + 8%
>4	Calculated SIZE

Table 15 - 1. SIZE Increase Chart

Overestimating the value of SIZE increases the amount of unused space in the cluster. If space efficiency is more important than the performance of data retrieval, disregard the above adjustments and use the estimated value for SIZE.

Setting HASHKEYS

For maximum distribution of rows in a hash cluster, HASHKEYS should always be a prime number.

For example, suppose you cluster the EMP table by DEPTNO, and there are 100 DEPTNOs, with values 10, 20, . . ., 1000. Assuming you bypass the internal hash function and you create a cluster with HASHKEYS of 100, then department 10 will hash to 10, department 20 to 20, . . ., department 110 to 10 (110 mod 100), department 120 to 20, and so on. Notice that there are 10 entries for hash values of 10, 20, . . ., but none for 1, 2, . . ., and so on. As a result, there is a lot of wasted space and possibly a lot of overflow blocks because of collisions. Alternatively, if HASHKEYS is set to 101, then each department number hashes to a unique hash key value.

Controlling Space in Hash Clusters: Examples

The following examples show how to correctly choose the cluster key and set the HASH IS, SIZE, and HASHKEYS parameters. For all examples, assume that the data block size is 2K and that on average, 1950 bytes of each block is available data space (block size minus overhead).

Example 1

You decide to load the EMP table into a hash cluster. Most queries retrieve employee records by their employee number. You estimate that the maximum number of rows in the EMP table at any given time is 10000 and that the average row size is 55 bytes.

In this case, EMPNO should be the cluster key. Since this column contains integers that are unique, the internal hash function can be bypassed. SIZE can be set to the average row size, 55 bytes; note that 34 hash keys are assigned per data block. HASHKEYS can be set to the number of rows in the table, 10000, rounded up to the next highest prime number, 10001:

CREATE CLUSTER emp_cluster (empno NUMBER)

. . .

SIZE 55

HASH IS empno HASHKEYS 10001;

Example 2

Conditions similar to the previous example exist. In this case, however, rows are usually retrieved by department number. At most, there are 1000 departments with an average of 10 employees per department. Note that department numbers increment by 10 (0, 10, 20, 30, . . . ).

In this case, DEPTNO should be the cluster key. Since this column contains integers that are uniformly distributed, the internal hash function can be bypassed. A pre-estimated SIZE (the average amount of space required to hold all rows per department) is 55 bytes * 10, or 550 bytes. Using this value for SIZE, only three hash keys can be assigned per data block. If you expect some collisions and want maximum performance of data retrieval, slightly alter your estimated SIZE to prevent collisions from requiring overflow blocks. By adjusting SIZE by 12%, to 620 bytes (see previous section about setting SIZE for clarification), only three hash keys are assigned per data block, leaving more space for rows from expected collisions.

HASHKEYS can be set to the number of unique department numbers, 1000, rounded up to the next highest prime number, 1009:

CREATE CLUSTER emp_cluster (deptno NUMBER)

. . .

SIZE 620

HASH IS deptno HASHKEYS 1009;