Key Generation While Using Partition

Aug 23, 2012 A critical component of database partitioning is the partitioning key; the partitioning key is used to identify the location of data in the partitioned table. The partitioning key is also used during query execution to avoid table scans by eliminating partitions. The difficulty comes in designing a good partitioning key. SQL Server Partitioning - what to use for partition key? Ask Question Asked 8 years, 4 months ago. Bear in mind to be effective all your queries need to include your partition key(s) so the engine knows which partition to check. Picking a Good Partition Is Key. The simplest way to ensure well-distributed keys is to generate a random key. The size of the key should be 16 bytes or more; the actual key used in Kinesis to distribute the data is a md5 of the the provided key. A simple implementation would be to use: UUID.randomUUID.toString as a partition key.

Mar 06, 2012  How To Decide if You Should Use Table Partitioning. March 6, 2012. Kendra Little. While live data coming to staging table can I create columnstore index on only day 361 partition and switch into table A? A look in sys.indexcolumns for that particular index shows the values for partition key column as 0 for “keyordinal” as well as. Partitioning Key. Each row in a partitioned table is unambiguously assigned to a single partition. The partitioning key is comprised of one or more columns that determine the partition where each row will be stored. Oracle automatically directs insert, update, and delete operations to the appropriate partition through the use of the partitioning key. RSA the Key Generation Example 1. Randomly choose two prime numbers pand q. We choose p= 11 and q= 13. Compute n= pq. We compute n= pq= 1113 = 143. In many cases, using a partition key is a good choice if event ordering is important. When you use a partition key, these partitions require availability on a single node, and outages can occur over time; for example, when compute nodes reboot and patch.

Key generation while using partition key

This article assumes the following:

  • You have a DB2 DPF environment and are familiar with DB2 DPF concepts.
  • You are either designing a new table that will be hash-partitioned, or you have an existing hash-partitioned table that might have a data skew problem.

This article helps you to accomplish the following tasks:

  • Choose the right initial partitioning key (PK) prior to defining and populating a table
  • Evaluate the quality of the existing PK on a table
  • Evaluate the quality of candidate replacement PKs on an existing table
  • Change the PK while keeping the table online

This article provides the following type of help:

  • Review of concepts and considerations
  • Design guidelines
  • New routines to estimate data skews for existing and new partitioning keys

Quick review of hash partitioning

In DPF environments, large tables are partitioned across multiple database partitions. There are several ways to partition a table, but the focus of this article is distribution by hash. For other ways to partition the table, refer to the article in the Related topics section.

Distribution by hash is based on partitioning keys. A partitioning key consists of one or more columns defined at table creation. For each newly inserted record, the partitioning key determines on which database partition this record should be stored. The placement is determined by an internal hashing function that takes the values in the column or columns defined as a partitioning key and returns the database partition number. A hashing function is a deterministic function, which means that for the same partitioning key values it always generates the same partitioning placement, assuming that there are no changes to the database partition group definition.

The following syntax examples demonstrate the steps necessary to create a hash-partitioned table:

  1. Create a database partition group that specifies the database partitions that will participate in the partitioning. The following example illustrates how to create a database partition group PDPG on database partitions 1, 2, 3, and 4:

    According to IBM Smart Analytics System and IBM InfoSphere™ Balanced Warehouse best practices, hash-partitioned tables should not be created on the coordinator or administration partition (database partition 0). Database partition 0 is typically used for storing small, non-partitioned lookup tables.

  2. Create the table space in the database partition group. All objects created in this table space will be partitioned across the database partitions specified in the database partition group definition:
  3. Create the table in the table space. At this point, the definition of the table is tied to the definition of the database partition group. The only way to change this relationship is to drop the table and recreate it in a different table space that is tied to a different database partition group.

    In the following example, Table1 is created on database partitions 1, 2, 3, and 4, and is redistributed based on a partitioning key on column COL1:


Keep in mind that the database partition group definition can change. For example, new database partitions can be added. If this happens, the hash-partitioned table defined prior the modification will not take advantage of the new partition until the database partition group is redistributed using the REDISTRIBUTE DATABASE PARTITION GROUP command.

Define the partitioning key

The partitioning key is defined using the DISTRIBUTED BY HASH clause in the CREATE TABLE command. After the partition key is defined, it cannot be altered. The only way to change it is to recreate the table.

The following rules and recommendations apply to the partitioning key definition:

  • The primary key and any unique index of the table must be a superset of the associated partitioning key. In other words, all columns that are part of the partitioning key must be present in the primary key or unique index definition. The order of the columns does not matter.
  • A partitioning key should include one to three columns. Typically, the fewer the columns, the better.
  • An integer partitioning key is more efficient than a character key, which is more efficient than a decimal key.
  • If there is no partitioning key provided explicitly in the CREATE TABLE command, the following defaults are used:
    • If a primary key is specified in the CREATE TABLE statement, the first column of the primary key is used as the distribution key.
    • If there is no primary key, the first column that is not a long field is used.

Why choosing the right partitioning key is important

Choosing the right partitioning key is critical for two reasons:

  • It improves the performance of the queries that use hashed partition
  • It balances the storage requirements for all partitions

Data balancing

Data balancing refers to the relative number of records stored on each individual database partition. Ideally, each database partition in a hash-partitioned table should hold the same number of records. If records are stored unequally across the database partitions, it can result in disproportional storage requirements and performance problems. The performance problems in this scenario result from the fact that the query work is done independently on each database partition, but the results are consolidated by the coordinating agent, which must wait until all database partitions return a result set. In other words, the total performance is tied to the performance of the slowest database partition.

Table data skew refers to a difference between the number of records in a table on particular database partitions and the average number of records across all database partitions for this table. So, for example, if the table data skew on database partition 1 is 60% for a particular table, it means that this database partition contains 60% more rows from this table than the average database partition. https://stylesplay.weebly.com/blog/eagle-65-download-mac.

From the best practices perspective, the table data skew on every individual database partition should be no more than 10%. To achieve this goal, the partitioning key should be selected on the columns that have high cardinality, or in other words, that contain a large number of distinct values.

If your table statistics are up to date, you can quickly and inexpensively check the cardinality of the columns in your existing table by issuing the following statement:

Listing 1. Checking the cardinality of the columns in an existing table

Collocation

Collocation between two joined tables in a query means that the matching rows of the two tables always reside in the same database partition. If the join is not collocated, the database manager must ship the records from one database partition to another over the network, which results in sub-optimal performance. There are certain requirements that must be met for the database manager to use the collocation join:

  • The joined tables must be defined in the same database partition group.
  • The partitioning key for each of the joined tables must match. In other words, they must contain the same number and sequence of columns.
  • For each column in the partitioning key of the joined tables, an equijoin predicate must exist.

If you choose a partitioning key based on your query workload, the partitioning key should typically consist of either a joined column or a set of columns that is frequently used in many queries.

Although collocated tables typically achieve the strongest performance, it is not possible in practice to collocate all tables. In addition, it is not a good idea to select partitioning keys based on a handful of SQL statements. In decision-support environments, queries can often be unpredictable. In this kind of environment, you should examine your data model to determine the best choice for partitioning keys. The data model and the business relationship between tables can provide a more stable way of selecting a partitioning key than specific SQL statements.

When choosing partitioning keys, draw a data model that shows the relationships among the tables that are in your database. /the-crew-key-generator-password.html. Identify frequent joins and high-use tables. Based on your data model, you should select partitioning keys that favor frequent joins and that are based on a primary key. Ideally, you should collocate frequently joined tables. Another strategy to improve the collocation of the join is to replicate smaller dimensional tables on each database partition.

Collocation compared to data balancing

In some cases, you might find that guidelines for choosing the proper partitioning key based on collocation and data balancing contradict one another. In such cases, it is recommended that you choose a partitioning key based on even data balancing.

Validate the partitioning keys on existing tables

If you want to validate how good your partitioning keys are, you should check to see if queries in your workload are collocated and if the data is balanced properly. It is also possible that over time, as your data changes, old partitioning keys become less optimal than they were previously. You can check the collocation in the query joins by looking at the access plan generated by DB2 Explain. If the query is not collocated, you typically will see the TQUEUE (table queue) operator feeding the join, as shown in Figure 1:

Figure 1. Explain graph that includes a TQUEUE operator

To check if the data in the table is balanced properly across the database partitions, you can run a simple count on your table grouped by the database partition ID with the help of the DBPARTITIONNUM function. Tally erp 9 serial key generator.

You can also use the custom stored procedure ESTIMATE_EXISTING_DATA_SKEW routine (available in the Download section), which provides more user-friendly output, including a list of database partitions, the skew percentage as compared to the average, and more. This routine can be run on a sample of the original data for faster performance. (See the Appendix for a full routine description.)

If you are planning to run this routine in a production environment, consider running it during a maintenance window or when the system is under a light load. You may also want to try it on one of the smaller tables with the sample value of 1% to get an estimate of how long it takes to return results. The total execution time is included at the bottom of the report.

Example 1

This example tests the data skew in a scenario in which the partitioning key was changed to S_NATIONKEY. This example uses only 25% of the data in the sampling. As you can see from the output, the data has some extensive skewing, and data volumes in some database partitions are 60% skewed.

Listing 2. Measuring the existing data skew for a single table

Example 2

This example demonstrates the usage of the wildcard character in the ESTIMATE_EXISTING_DATA_SKEW routine. Listing 3 shows a report for the existing data skew on all tables that have schema TPCD and a table name that starts with 'PART.' Since the tables are relatively large, the sample is built on 1% of the data to reduce the performance cost.

Listing 3. Measuring the existing data skew for multiple tables

Evaluate the quality of candidate replacement PKs on the existing table

If you decide to change an existing partitioning key, it is important to determine if the new partitioning key that you are considering will result in good query collocation and evenly distributed data.

To check for query collocation, it is recommended that you collect the queries that characterize your workload, place them in a file, and then run a db2advis report to get recommendations on the new partitioning keys:

You can also run a report based on the recently executed queries that still reside in the package cache using the following form of the db2advis utility:

Listing 4 provides an example db2advis output:

Listing 4. db2advis output

To check if the data would be properly balanced using the new partitioning key, you can use the routine ESTIMATE_NEW_DATA_SKEW that is also provided in the Download section. This routine creates a copy of your existing table with the new partitioning key and loads it partially or fully with the data from the original table. It then runs the same report for the existing data skew estimation and, at the end. drops the copy table. Note that the table space containing the original table must be able to hold a minimum of 1% of the data from the original table since the copied version is created in the same table space.

Example 3

This example tests the data skew in a scenario in which the partitioning key was changed from S_NATIONKEY to S_ID. This example uses 100% of the data in the sampling. As this example demonstrates, the new partitioning key causes minimal data skew and is a much better choice than the original S_NATIONAL key from Example 1.

Key
Listing 5. Estimating the data skew for a new partitioning key

Change the PK while keeping the table online

A new routine in DB2 9.7 named ADMIN_MOVE_TABLE allows you to automatically change the partitioning key of a table while keeping the table fully accessible for reads and writes. In addition to the partitioning key change, this procedure can move the table to a different table space, change column definitions, and more.

Example 4

This example changes the partitioning key of the TPCD.PART table from COL1 to (COL2, COL3). It uses the LOAD option to improve the performance of the ADMIN_MOVE_TABLE routine.

Listing 6. Changing partitioning keys

While the ADMIN_MOVE_TABLE procedure is running, the TPCD.PART table is fully accessible and the change to the partitioning key is transparent to the end users.

Conclusion

Choosing appropriate partitioning keys is essential for optimizing database performance in a partitioned environment based on DB2 software. This article provided guidance and tooling for choosing the best partitioning keys based on your needs.

This article described:

  • The concepts related to partitioning keys, and the rules and recommendations for creating partitioning keys
  • Routines that can help you estimate the data skew for new and existing partitioning keys
  • How to change partitioning keys while keeping the table fully accessible

Appendix: Routine reference documentation

Prerequisites

Both the ESTIMATE_EXISTING_DATA_SKEW and the ESTIMATE_NEW_DATA_SKEW procedures are supported in DB2 9.7 or later. The routine ADMIN_MOVE_TABLE that is used for the actual movement of the table is shipped with the core DB2 9.7 product or later. For the ESTIMATE_NEW_DATA_SKEW routine, there must be enough free space in the table space that contains the original table to store the sampling data.

Deployment instructions

  1. Download and save the estimate_data_skew.sql file found in the Download section.
  2. Connect to the database from the command line and deploy the routines using the following command:

ESTIMATE_NEW_DATA_SKEW procedure

The ESTIMATE_NEW_DATA_SKEW routine estimates the data skew of individual database partitions on an existing table with a new partitioning key. To improve the performance and lower the storage requirements of this routine, the estimation can be based on a subset of the data using extremely fast sampling on the page level.

Syntax

Procedure parameters

csv_format (optional)
This optional input parameter is used to request the format in which the data will be returned. A value of 'Y' for this parameter requests that the procedure will return the data in CSV format. This CSV formatted data is then under the headings: SCHEMA, TABLE, PARTITION, SAMPLE%, TABLEROWCOUNT, PARTAVG, PARTROWCOUNT and SKEW. The parameter defaults to 'N' for regular format.
in_tabschema
This input parameter specifies the name of the schema that contains the table to be estimated for data skews. This parameter is case-sensitive and has a data type of VARCHAR(128). This parameter does not support wildcards.
in_tabname
This input parameter specifies the name of the table to be estimated for data skews. This parameter is not case-sensitive and has a data type of VARCHAR(128). This parameter does not support wildcards.
new_partitioning_keys
This input parameter specifies the new partitioning keys to be used in the estimation of data skews.
sampling_percentage
This input parameter specifies the percentage of data to be used in the data skew estimation. Valid values are 1 to 100, where 100 means that the stored procedure will use all records in the table for the estimation. The purpose of this parameter is to improve performance and minimize the space usage when estimating data skew with new partitioning keys. If performance and disk space are not an issue, specify 100 for this value.

Key Generation While Using Partition 1

ESTIMATE_EXISTING_DATA_SKEW procedure

The ESTIMATE_EXISTING_DATA_SKEW stored procedure estimates the data skew of individual database partitions in one or more tables based on the existing partitioning keys. To improve the performance of this procedure, the estimation can be based on a subset of the data using extremely fast sampling on the page level.

Syntax

Procedure parameters

csv_format (optional)
This optional input parameter is used to request the format in which the data will be returned. A value of 'Y' for this parameter requests that the procedure will return the data in CSV format. This CSV formatted data is then under the headings: SCHEMA, TABLE, PARTITION, SAMPLE%, TABLEROWCOUNT, PARTAVG, PARTROWCOUNT and SKEW. The parameter defaults to 'N' for regular format.
in_tabschema
This input parameter specifies the name of the schema that contains the table to be estimated for data skews. This parameter is case-sensitive and has a data type of VARCHAR(128). This parameter supports % as a wildcard. If the NULL value is specified, a report will be run for all schemas defined in the database.
in_tabname
This input parameter specifies the name of the table to be estimated for data skews. This parameter is not case-sensitive and has a data type of VARCHAR(128). This parameter supports % as a wildcard.
sampling_percentage
This input parameter specifies the percentage of data to be used in the data skew estimation. Valid values are 1 to 100, where 100 means that the stored procedure will use all records in the table for the estimation.

Downloadable resources

  • Sample SQL script for this article (estimate_data_skew.zip 4KB)

Key Generation While Using Partition Windows 7

Related topics

Key Generation While Using Partitions

  • 'DB2 partitioning features' (developerWorks, August 2006): Get an introduction to some of the DB2 LUW table design features, such as table partitioning, multidimensional clustering (MDC), database partitioned tables, and materialized query tables (MQT).
  • DB2 for Linux, UNIX, and Windows area on developerWorks: Get the resources you need to advance your DB2 skills.
  • DB2 Information Center: Connect to the DB2 Information Center to get more information about partitioned database environments.
    • 'ADMIN_MOVE_TABLE procedure - Move an online table'
    • 'db2advis - DB2 design advisor command'
  • DB2 9.7 for Linux, UNIX, and Windows: Download a free trial version of DB2 9.7 for Linux, UNIX, and Windows.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.