Sunday, January 29, 2012

SSIS Package Control Flow Components

The control flow in the Integration Services package is constructed by using different types of control flow elements: the containers that provide structure in packages and services to tasks, tasks that provide functionality in packages, and precedence constraints that connect containers and tasks into a control flow. Below is the description of the heavily used component in control flow task;
1.) Sequence Container:  This container can be used when business wants to create logical group of control flow tasks.
Example: Consider a scenario when based on some condition we want to perform set of operation. In that case we can combine task corresponding to conditional result and accommodate tasks in various container where we can flow the control of the SSIS package.

2.) For Loop Container: This component is used when business wants to execute set of tasks multiple time based on some conditions.
Example: Consider a scenario when we want to perform data transfer from source to destination for 5 files stored in the database. With condition set as the 5 iteration, we can run the control flow tasks 5 times. 
3.) For each Loop Container: This component is used when business wants to execute set of tasks based on some dataset that is mostly dynamic in nature. This dataset can be accommodating more conditional information compare to for loop container. 
Example: Consider a scenario when we have files stored in some folder and for each file we want to fetch data and transfer file data to destination database.

4.) Analysis Services Execute DDL Task: This component can be used when business wants to run the OLAP query on SSAS cubes.
Example: Consider a scenario when we want to create replica of the cube. Using this task we can fire OLAP query that can create replica of SSAS cube.

5.) Analysis Services Processing Task:  This component can be used to process the SSAS cube full or partial mode.
Example: Consider a scenario when SSIS job is populating data into data warehouse that contains fact and dimension table and we want to process the cube after pushing data into dimension and fact tables.

6.) Bulk Insert Task: This component can load data into a table by using the BULK INSERT SQL command.
Example: Consider a scenario when we have large flat files in some folder and we want to transfer data from flat files to SQL server. If we have flat file size large and we want to transfer data using traditional data flow task, it might affect performance of the package. Using bulk insert task we can push data into destination with performance improvement but here we cannot transform the data in between as we can do in data flow task.

7.) Data Flow Task:  This component should be used when business wants to perform ETL operation i.e. Extract data from source, Apply Transformation to data and Load data into destination database.

8.) Execute Package Task: This component can be used when business wants to execute SSIS packages from some other SSIS packages.
Example: Consider a scenario when we have data warehouse and we want to execute dimension packages and fact packages from a master package that control execution hierarchy.

9.) Execute Process Task: This component can be used when business wants to runs an application or batch file as part of a SQL Server Integration Services package workflow.
Example: Business can use the Execute Process task to expand a compressed text file. Then the package can use the text file as a data source for the data flow in the package.

10.) Execute SQL Task: This component can be used when business wants to execute SQL statement of SQL objects like function, stored procedure on a particular DBMS.

11.) File System Task: This component can be used to perform file operation such as to crate directory and files, copy or delete directory or files, to move files or directory, rename files, set attributes etc.

12.) FTP Task: This component can be used when business want to download and uploads data files and manages directories on servers.
Example: a package can download data files from a remote server or an Internet location as part of an Integration Services package workflow. You can use the FTP task for the following purposes:
·         Copying directories and data files from one directory to another, before or after moving data, and applying transformations to the data.
·         Logging in to a source FTP location and copying files or packages to a destination directory.
·         Downloading files from an FTP location and applying transformations to column data before loading the data into a database.

13.) Message Queue Task:  The Message Queue task allows you to use Message Queuing (also known as MSMQ) to send and receive messages between SQL Server Integration Services packages, or to send messages to an application queue that is processed by a custom application. These messages can take the form of simple text, files, or variables and their values.
By using the Message Queue task, you can coordinate operations throughout your enterprise. Messages can be queued and delivered later if the destination is unavailable or busy.
Example: The output from a restaurant cash register can be sent in a data file message to the corporate payroll system, where data about each waiter's tips is extracted.

14.) Script Task: This component can be used when business require code to perform functions that are not available in the built-in tasks and transformations that SQL Server Integration Services provides. This code can be written in C# or VB.NET.
Example: A script can use Active Directory Service Interfaces (ADSI) to access and extract user names from Active Directory.

15.) Send Mail Task: This component can be used when business want to send email notification. Business can configure subject, body text along with details such as CC, BCC, from, to, attachments etc.
Example: Send mail task can be used to send notifications that indicate execution of SSIS package and end of failure of SSIS package.

16.) Transfer Database task: This component can be used when business wants to transfers a SQL Server database between two instances of SQL Server only. The Transfer Database task supports SQL Server 2000 and SQL Server. It can transfer a database between instances of SQL Server 2000, instances of SQL Server, and from an instance of SQL Server 2000 to an instance of SQL Server.

17.) Transfer Error Messages Task: This component can be used when business wants to transfers one or more SQL Server user-defined error messages between instances of SQL Server. User-defined messages are messages with an identifier that is equal to or greater than 50000. Messages with an identifier less than 50000 are system error messages, and cannot be transferred by using the Transfer Error Messages task. The Transfer Error Messages task supports a source and destination that is SQL Server 2000 or SQL Server. There are no restrictions on which version to use as a source or destination.

18.) Transfer Jobs Task: This component can be used when business wants to transfers one or more SQL Server Agent jobs between instances of SQL Server. There are no restrictions on which of the two versions to use as a source or destination.

19.) Transfer SQL Server Objects Task: This component can be used when business wants to transfers one or more types of objects in a SQL Server database between instances of SQL Server. For example, the task can copy tables and stored procedures. The Transfer SQL Server Objects task supports a source and destination that is SQL Server 2000 or SQL Server. There are no restrictions on which version to use as a source or destination.

20.) Web Service Task: This component can be used when business wants to execute a Web service method and utilize its result. You can use the Web Service task for assigning value to a SSIS package variable or for writing value to flat/XML file that has been returned by Web service method.
Example:  Business could obtain the highest temperature of the day from a Web service method, and then use that value to update a variable that is used in an expression that sets a column value.

21.) Maintenance Plan Task: SQL Server Integration Services includes a set of tasks that perform database maintenance functions. These tasks are commonly used in database maintenance plans, but the tasks can also be included in SSIS packages. The maintenance tasks can be used with SQL Server 2000 and MS-SQL Server databases and database objects. The following table lists the maintenance tasks and its usage;


Task
Description
Back Up Database Task
Performs different types of SQL Server database backups.
Checks the allocation and structural integrity of database objects and indexes.
Runs SQL Server Agent jobs.
Runs Transact-SQL statements
Deletes entries in the history tables in the SQL Server msdb database.
Removes files related to maintenance plans, including reports created by maintenance plans and database backup files.
Sends notification messages to SQL Server Agent operators.
Rebuilds indexes in SQL Server database tables and views.
Reorganizes indexes in SQL Server database tables and views.
Reduces the size of SQL Server database data and log files.
Updates information about the distribution of key values for one or more sets of statistics on the specified table or view.


Monday, January 2, 2012

SSIS - Best practices


Best Practice #1 - Pulling High Volumes of Data

This is the specific scenario with huge data and destination tables having indexes. If the requirement is to transfer a huge data (more then 100 million records), it is preferable to drop all the non clustered and clustered indexes from the destination tables and then only package should start the data transfer. After the transfer gets completed, package must create clustered and non clustered indexes again on destination tables.

Reason: Consider a scenario, where we have destination table with one clustered index and three non clustered indexes. Job of the SSIS package is to push 100 million records into mention destination table. At the begining, the job will push data properly for some records (up to some million) but after some time the performance will be degraded as the SQL server will consume more and more time to maintain the indexes for each insert.

Best Practice #2 - Avoid SELECT * in Look Up component of Data Flow Task

To fetch the look up data use result of the SQL query rather then table or view option in connection tab of look up component as shown in image 1.0.


(Image 1.0)

Reason: Consider a scenario, package is using table with 10 columns to fetch look up values. If package has option to look data into table or view, it will fetch the records with 10 columns. That result in increase of memory buffer size. If the table has huge data then this can be a reason for bad performance. So it is preferable to use SQL query having only required column to fetch the look up values.

Best Practice #3 – Use combination of Look up with OLE DB destination and OLE DB command task as the replacement of Slowly Changing Dimensions (SCD):

SCD Task performs poor and this starts getting realized as soon as the data flow increase beyond the limit of 50k records. It is preferable to use look up component to find the existence of records in destination table and based on the key values that record can be inserted if look up match not found and columns cab be updated using OLE DB command destination task if record present in destination table (also that record can be updated as a historical record and by adding one more OLE DB task after that records can be inserted).

As an alternative, package can use MERGE statement of SQL Server 2008 instead of SCD Task for small chunk of data. But it should not be used if the size of the destination table is huge.

Best Practice #4 – In the data flow task avoid blocking transformations

Partially blocking transformations: Partially blocking transformations are often used to combine datasets. They tend to have multiple data inputs. As a result, their output may have the same, greater, or fewer records than the total number of input records. Since the number of input records will likely not match the number of output records, these transformations are also called asynchronous transformations. Examples of partially blocking transformation components available in SSIS include Merge, Merge Join, and Union All. With partially blocking transformations, the output of the transformation is copied into a new buffer and a new thread may be introduced into the data flow.

Blocking transformations: Blocking transformations must read and process all input records before creating any output records. Of all of the transformation types, these transformations perform the most work and can have the greatest impact on available resources. Example components in SSIS include Aggregate and Sort. Like partially blocking transformations, blocking transformations are also considered to be asynchronous. Similarly, when a blocking transformation is encountered in the data flow, a new buffer is created for its output and a new thread is introduced into the data flow.

Example: In the design shown in Image 2.0, a Script Component generates 100,000,000 rows that first pass through a lookup. If the lookup fails because the source value is not found, then an error record is sent to the Derived Column transformation where a default value is assigned to the error record. After the error processing is complete, the error rows are combined with the original data set before loading all rows into the destination.

Analysis: Like Design Alternative 1, design Alternative 2 uses the same Script Component to generate 100,000,000 rows that pass through a lookup. In Design Alternative 2, Instead of handling lookup failures as error records, all lookup failures are ignored. Rather, a Derived Column transformation is used to assign values to the columns that have NULL values for the looked up column.




(Image 2.0)

Result: With two execution alternative in this scenario, the biggest performance bottleneck is related to the extra copy of the data in memory created for the Partially Blocking Union All transformation. The performance of design alternative 2 is faster than Design Alternative 1. With one execution tree in this scenario, the operations are consolidated and the overheard of copying data into a new buffer is avoided

Best Practice #5 - Avoid SELECT *

The Data Flow Task (DFT) of SSIS uses a buffer (a chunk of memory) oriented architecture for data transfer and transformation. When data travels from the source to the destination, the data first comes into the buffer; required transformations are done in the buffer itself and then written to the destination.

The size of the buffer is dependant on several factors; one of them is the estimated row size. The estimated row size is determined by summing the maximum size of all the columns in the row. So the more columns in a row mean less number of rows in a buffer and with more buffer requirements the result is performance degradation. Hence it is recommended to select only those columns which are required at destination. Even if you need all the columns from the source, you should use the column name specifically in the SELECT statement otherwise it takes another round for the source to gather meta-data about the columns when you are using SELECT *.

Best Practice #6 - Effect of OLEDB Destination Settings
There are couples of settings with OLEDB destination which can impact the performance of data transfer as listed below.

A.) Data Access Mode – This setting provides the 'fast load' option which internally uses a BULK INSERT statement for uploading data into the destination table instead of a simple INSERT statement (for each single row) as in the case for other options. So unless you have a reason for changing it, don't change this default value of fast load. If you select the 'fast load' option, there are also a couple of other settings which you can use as discussed below.

B.) Keep Identity – By default this setting is unchecked which means the destination table (if it has an identity column) will create identity values on its own. If you check this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table.

C.) Keep Nulls – Again by default this setting is unchecked which means default value will be inserted (if the default constraint is defined on the target column) during insert into the destination table if NULL value is coming from the source for that particular column. If you check this option then default constraint on the destination table's column will be ignored and preserved NULL of the source column will be inserted into the destination.

D.) Table Lock – By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at same time. It specifies a table lock will be acquired on the destination table instead of acquiring multiple row level locks, which could turn into lock escalation problems.

Best Practice #7 - Effect of Rows per Batch and Maximum Insert Commit Size Settings

Maximum insert commit size – The default value for this setting is '2147483647' (largest value for 4 byte integer type) which specifies all incoming rows will be committed once on successful completion. You can specify a positive value for this setting to indicate that commit will be done for those number of records. You might be wondering, changing the default value for this setting will put overhead on the dataflow engine to commit several times. Yes that is true, but at the same time it will release the pressure on the transaction log and tempdb to grow tremendously specifically during high volume data transfers.

The above setting is very important to understand to improve the performance of tempdb and the transaction log. For example if you leave 'Max insert commit size' to its default, the transaction log and tempdb will keep on growing during the extraction process and if you are transferring a high volume of data the tempdb will soon run out of memory as a result of this your extraction will fail. So it is recommended to set these values to an optimum value based on your environment.