Akshay Jain: SSIS - Best practices

Best Practice #1 - Pulling High Volumes of Data

This is the specific scenario with huge data and destination tables having indexes. If the requirement is to transfer a huge data (more then 100 million records), it is preferable to drop all the non clustered and clustered indexes from the destination tables and then only package should start the data transfer. After the transfer gets completed, package must create clustered and non clustered indexes again on destination tables.

Reason: Consider a scenario, where we have destination table with one clustered index and three non clustered indexes. Job of the SSIS package is to push 100 million records into mention destination table. At the begining, the job will push data properly for some records (up to some million) but after some time the performance will be degraded as the SQL server will consume more and more time to maintain the indexes for each insert.

Best Practice #2 - Avoid SELECT * in Look Up component of Data Flow Task

To fetch the look up data use result of the SQL query rather then table or view option in connection tab of look up component as shown in image 1.0.

(Image 1.0)

Reason: Consider a scenario, package is using table with 10 columns to fetch look up values. If package has option to look data into table or view, it will fetch the records with 10 columns. That result in increase of memory buffer size. If the table has huge data then this can be a reason for bad performance. So it is preferable to use SQL query having only required column to fetch the look up values.

Best Practice #3 – Use combination of Look up with OLE DB destination and OLE DB command task as the replacement of Slowly Changing Dimensions (SCD):

SCD Task performs poor and this starts getting realized as soon as the data flow increase beyond the limit of 50k records. It is preferable to use look up component to find the existence of records in destination table and based on the key values that record can be inserted if look up match not found and columns cab be updated using OLE DB command destination task if record present in destination table (also that record can be updated as a historical record and by adding one more OLE DB task after that records can be inserted).

As an alternative, package can use MERGE statement of SQL Server 2008 instead of SCD Task for small chunk of data. But it should not be used if the size of the destination table is huge.

Best Practice #4 – In the data flow task avoid blocking transformations

Partially blocking transformations: Partially blocking transformations are often used to combine datasets. They tend to have multiple data inputs. As a result, their output may have the same, greater, or fewer records than the total number of input records. Since the number of input records will likely not match the number of output records, these transformations are also called asynchronous transformations. Examples of partially blocking transformation components available in SSIS include Merge, Merge Join, and Union All. With partially blocking transformations, the output of the transformation is copied into a new buffer and a new thread may be introduced into the data flow.

Blocking transformations: Blocking transformations must read and process all input records before creating any output records. Of all of the transformation types, these transformations perform the most work and can have the greatest impact on available resources. Example components in SSIS include Aggregate and Sort. Like partially blocking transformations, blocking transformations are also considered to be asynchronous. Similarly, when a blocking transformation is encountered in the data flow, a new buffer is created for its output and a new thread is introduced into the data flow.

Example: In the design shown in Image 2.0, a Script Component generates 100,000,000 rows that first pass through a lookup. If the lookup fails because the source value is not found, then an error record is sent to the Derived Column transformation where a default value is assigned to the error record. After the error processing is complete, the error rows are combined with the original data set before loading all rows into the destination.

Analysis: Like Design Alternative 1, design Alternative 2 uses the same Script Component to generate 100,000,000 rows that pass through a lookup. In Design Alternative 2, Instead of handling lookup failures as error records, all lookup failures are ignored. Rather, a Derived Column transformation is used to assign values to the columns that have NULL values for the looked up column.

(Image 2.0)

Result: With two execution alternative in this scenario, the biggest performance bottleneck is related to the extra copy of the data in memory created for the Partially Blocking Union All transformation. The performance of design alternative 2 is faster than Design Alternative 1. With one execution tree in this scenario, the operations are consolidated and the overheard of copying data into a new buffer is avoided

Best Practice #5 - Avoid SELECT *

The Data Flow Task (DFT) of SSIS uses a buffer (a chunk of memory) oriented architecture for data transfer and transformation. When data travels from the source to the destination, the data first comes into the buffer; required transformations are done in the buffer itself and then written to the destination.

The size of the buffer is dependant on several factors; one of them is the estimated row size. The estimated row size is determined by summing the maximum size of all the columns in the row. So the more columns in a row mean less number of rows in a buffer and with more buffer requirements the result is performance degradation. Hence it is recommended to select only those columns which are required at destination. Even if you need all the columns from the source, you should use the column name specifically in the SELECT statement otherwise it takes another round for the source to gather meta-data about the columns when you are using SELECT *.

Best Practice #6 - Effect of OLEDB Destination Settings

There are couples of settings with OLEDB destination which can impact the performance of data transfer as listed below.

A.) Data Access Mode – This setting provides the 'fast load' option which internally uses a BULK INSERT statement for uploading data into the destination table instead of a simple INSERT statement (for each single row) as in the case for other options. So unless you have a reason for changing it, don't change this default value of fast load. If you select the 'fast load' option, there are also a couple of other settings which you can use as discussed below.

B.) Keep Identity – By default this setting is unchecked which means the destination table (if it has an identity column) will create identity values on its own. If you check this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table.

C.) Keep Nulls – Again by default this setting is unchecked which means default value will be inserted (if the default constraint is defined on the target column) during insert into the destination table if NULL value is coming from the source for that particular column. If you check this option then default constraint on the destination table's column will be ignored and preserved NULL of the source column will be inserted into the destination.

D.) Table Lock – By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at same time. It specifies a table lock will be acquired on the destination table instead of acquiring multiple row level locks, which could turn into lock escalation problems.

Best Practice #7 - Effect of Rows per Batch and Maximum Insert Commit Size Settings

Maximum insert commit size – The default value for this setting is '2147483647' (largest value for 4 byte integer type) which specifies all incoming rows will be committed once on successful completion. You can specify a positive value for this setting to indicate that commit will be done for those number of records. You might be wondering, changing the default value for this setting will put overhead on the dataflow engine to commit several times. Yes that is true, but at the same time it will release the pressure on the transaction log and tempdb to grow tremendously specifically during high volume data transfers.

The above setting is very important to understand to improve the performance of tempdb and the transaction log. For example if you leave 'Max insert commit size' to its default, the transaction log and tempdb will keep on growing during the extraction process and if you are transferring a high volume of data the tempdb will soon run out of memory as a result of this your extraction will fail. So it is recommended to set these values to an optimum value based on your environment.

1 comment:

sathyaNovember 10, 2016 at 12:32 AM
After looking into a handful of the blog articles on your site, I really like your technique of writing a blog. I book marked it to my bookmark site list and will be checking back in the near future. Take a look at my website as well and let me know your opinion.

MSBI Training in Chennai

Monday, January 2, 2012

SSIS - Best practices

1 comment: