David Cross
What is scripting? Why should I incorporate it into DataStage?
DataStage (referred to as DS for the remainder of this article) is a wonderful ETL (Export, Transformation, Load) tool. When we add shell scripting, new functionality is available to us.
We are going to present you with some scenarios that you might use scripts for. I will list some of the ways they are typically used. In addition, we will use a simple example to walk through the process from beginning to end. This will help you better understand the components involved.
For scripting, you can use languages like Python or PERL, or just UNIX command line. The scripts or commands are read within the Execute Command stage. We will explain how to use them. What is not included is how to get your data and script file loaded onto the server. That is outside the scope of this article.
Table of Contents:
- Script Examples
- Shell Scripting Example in Data Stage: Unzip Script
- Unzip Script
- Parallel Job Configuration
- Sequence Job Configuration
- Job Execution
Script Usage Examples
- Custom formatting of Email messages including attachments
- Check for the existence of a file
- Start or stop jobs based on an outcome
- Create a dynamic SQL statement
- File manipulation: zip/unzip, move, archive, and delete files
- Check date before executing jobs
- Creation of custom log files
- Record Counting
We are going to describe some of the above examples, for better insight.
Custom Email Formatting
DS has both a built-in Notification stage and a Routine Stage that is useful for sending out emails. In our project we had a requirement to customize the body, subject and add an attachment. This was performed easily with Python and ‘sendmail’.
Custom Log Files
Some DS projects have many stages. We wanted a way to track the steps that were performed and the duration of each step. A PERL script was created that wrote a log entry whenever each step was performed. We also added the start and end time. This was attached to an email when the job was completed.
File Manipulation
Shell/UNIX scripting excels at file manipulation, such as moving, deleting, archiving, compressing and uncompressing of files. We have used this extensively for input files.
Shell Scripting Example in Data Stage: Unzip Script
Typically, a zip file is moved to a folder using FTP. The file is unzipped, then extracted by DS using the Sequential Stage. Using UNIX commands, the extracted file can be zipped again and moved to an archive folder. In addition, we can check file timestamps for dates older than 1 month old, for example. These can be deleted.
As described above, unzipping of an input file before ingesting the data is a common task. This is a perfect job for a script. Let us create a DS job (Sequencer and Parallel job) with this practical purpose.
The job will execute the following:
- The zipped file will land in the folder that is accessible to DS.
- Our Execute Command stage will read our script to unzip it.
- A Sequential stage will ingest the file.
- A DB2 Connector stage will load the contents into the target table.
Unzip Script
This is our unzip_example.pl PERL script. It will perform the unzipping of the input file. Save this file on the DS server where it can be accessed.
- #!/usr/bin/perl -w
- # -- UNZIP FILE EXAMPLE
- # -- The file we want to unzip
- my $file = 'testdata.zip';
- # -- The directory of the file
- my $specifiedDir = '/ds-shared/data/';
- # -- The command to unzip the file
- my $rc = system("/usr/bin/unzip -jp $file -d $specifiedDir") or warn "$!\n";
Parallel Job Configuration
1. Open DS Designer and create a new Parallel job (Menu > File > New Parallel Job). Click the OK button.
2. Go to the menu again and navigate to View > Palette, this will make it so that we can drag two stages onto the canvas.
3. First choose the DB2 Connector Stage, or whatever Database type connector you are using in your project (Neteeza, Oracle, JDBC, ODBC, Informix, etc). Move this to the right side of the canvas.
Sequential File Stages
4. Second, choose the Sequential File Stage and drag that to the left side of the canvas.
5. Drag the mouse from the Sequencer to the DB2 connector to join them. If you desire, you can add an Annotation as well, for notes. Right click on each of the stages and add a meaningful name for them. Your job should look like the image below.
Save and Properties
6. Save your job now using File > Save As. We will call this job LOAD_UNZIPPED_FILE.
7. Let us configure the Sequencer stage ‘EXTRACT_UNZIPPED_CSV_FILE’. Double click to open it.
8. On the Properties tab, click on ‘File’ and then on the right side, click the black triangle and choose “Browse for File”. Navigate to the folder where your zipped file is. Repeating what we said earlier, this assumes you have access to the DS folder and were able to previously add the file.
Access path and name
9. Add the path and the name of the ‘unzipped’ This is important. The script is going to be unzipping the file, but the Sequencer is ingesting the csv file. Click the ok button to save.
10. Double click the DB2 connector to configure this stage. Add your database name, user id, and password, and your target table. Save this stage and then save your job.
Sequence Job Configuration
1. Create a new Sequence Job and drag the following onto the canvas, arranging them from left to right.
- Execute Command stage
- Job Activity stage
- Notification Activity
2. Connect the three stages and apply meaningful names. Save the job. In our case it is called SEQ_UNZIP_SCRIPT. The job should look like this:
ExecCommand
3. To configure the Execute Command, go to the ‘Command’ section, add your file path and the name of the script.
Activity Stage
4. In the Activity Stage, browse for the name of your Parallel job and populate ‘Job name’. In our case it is LOAD_UNZIPPED_FILE.
Notifications Stage
5. For the Notification stage, enter your email details (server, recipient, sender, subject title) and click OK to save.
6. Save the changes to the Sequencer job SEQ_UNZIP_SCRIPT.
Job Execution
We are now ready to start the job:
- SEQ_UNZIP_SCRIPT will execute the commands in the script pl.
- The data file will be unzipped.
- The Parallel job LOAD_UNZIPPED_FILEwill run, extracting the data from the file and importing it into a database table.
- Finally, a notification will be sent out, informing us that it has completed.
Conclusion
A note about our example. The unzip script is simple. In the example, we have deliberately left out the checks and balances that would typically be added to a job. This is for simplicity. At the very least, Variable and Nested Condition stages should be added, for branching the results of each action (ex: file not found, or data corruption). Additionally, we can use scripts to create custom logs and notifications.
We have also given you just a few examples of script operations. Perhaps your project requirements will entice you to develop new ways to incorporate scripts. Scripting and the ETL functions of DS, together, are a great combination.