Advanced Concepts of dfPower Studio: DataFlux Tips and Tricks
Chris Martin, Client Services Manager
Gary Townsend, Solutions Consultant
Consuming a directory of input files with a single text file input node
How can I consume 500 text files sitting in a single folder, all having the same field structure? Using 500 different input nodes would be tedious and time consuming.
Creating one job with a macro variable would be more efficient, but would still require that job be run 500 times.
Using the DataFlux delimited and fixed width input nodes we can point a single input node at an entire directory of files instead of a single file itself. This will cause DataFlux to append the contents of each file together (data union), leaving the end user with a usable data set consisting of all records across all input files. See the example below for a description of how to use this feature.
Consuming a directory of input files with a single text file input node
Output of directory read:
Using macro variables to create dynamic file names
Always wanted to dynamically create files names or pass table values between pages in an Architect job? The example below shows how to create and make use of macro variables and their values between pages of an Architect job.
Defining Data: PAGE 1 of JOB
EXPRESSION NODE:
Pre‐Expression Expression Post Expression
Integer Execution_Number Seteof() setvar("Ex_Num",Execution_Number)
Execution_Number = 2239 setvar("TodaysDT",DateToday)
String DateToday
DateToday = today()
DateToday = FormatDate(DateToday,"MMDDYYYY")
Pushrow()
Using macro variables to create dynamic file namesRetrieving Data: PAGE 2 of JOBEXPRESSION NODE:Pre‐Expressionstring Execution_NumberString TodayDateExecution_Number = getvar("EX_NUM")TodayDate = getvar("TodaysDT")pushrow()
ExpressionSeteof()
Text File Output:
FileName Property Value: C:\%%Execution_Number%%_%%TodayDate%%.txt
Expected Results:
1 row of data will be written to a file called 2239_TODAYS_DATE.txt
Inside of it you will see the value of Execution_Number as well as the value stored in the TodayDate macro variable. The possibilities of using macros in this manner are endless. Experience with other uses.
Passing Macro Variable Values as a Command Line Argument
We all realize DataFlux’s ability to utilize macro variables within Architect & Profile jobs, but what if you do not want to declare them as static values within the architect.cfg file?
As part of a command line argument when invoking DataFlux from the command line, we are able to dynamically declare macro variable values at the time of execution. Below are some examples of the syntax required on both Unix & Windows Platforms:
UNIX/LINUX DIS PLATFORMS:
INPUT_FILE=/dataflux/input/audit1.txt OUTPUT_FILE=/dataflux/output/audit1_out.txt /dfpower/bin/dfexec –log /dataflux/DISjoblob/joblogname.log ../var/dis_arch_job/jobname.dmc
WINDOWS DIS PLATFORM:
Set INPUT_FILE=C:\dataflux\input\audit1.txt &Set OUTPUT_FILE=C:\dataflux\output\audit1_out.txt & “C:\Program Files\DataFlux\DIS\8.2\bin\dfexec.bat” –log c:\dataflux\DISjoblob\joblogname.log c:\dataflux\jobs\jobname.dmc
Architect Node ‐ Advanced Properties Using the advanced properties of nodes within dfPower Architect can drastically save you time and effort associated with managing a large number of fields.The following examples contain two very practical uses of the advanced properties.
Copy & paste fields from external data provider into job specific data node
Architect Node ‐ Advanced Properties
Architect Node ‐ Advanced Properties Standardizing all fields into same field name
=
Alternate Date/Time Extraction Methods
Counting Records in Text FileWe all know how simple it is to extract a count of records from a database table, but what if you want to determine how many records exist in a text file so you can increment a counter accordingly.
•Option 1: Open File in Notepad, count records one by one•Option 2: Just take a wild guess and hope you are close•Option 3: Let DataFlux do the counting!
Counting Records in Text File
Let DataFlux count your records….
We will stick to…
Remove control characters within your data:
Before: After:
ASCII Control CharactersChar Oct Dec Hex Control-Key Control Action
NUL 0 0 0 ^@ NULl characterSOH 1 1 1 ^A Start Of HeadingSTX 2 2 2 ^B Start of TeXtETX 3 3 3 ^C End of TeXtEOT 4 4 4 ^D End Of TransmissionENQ 5 5 5 ^E ENQuiryACK 6 6 6 ^F ACKnowledgeBEL 7 7 7 ^G BELl, rings terminal bellBS 10 8 8 ^H BackSpace (non-destructive)
HT 11 9 9 ^IHorizontal Tab (move to next
tab position)LF 12 10 a ^J Line FeedVT 13 11 b ^K Vertical TabFF 14 12 c ^L Form FeedCR 15 13 d ^M Carriage ReturnSO 16 14 e ^N Shift OutSI 17 15 f ^O Shift In
DLE 20 16 10 ^P Data Link Escape
DC1 21 17 11 ^QDevice Control 1, normally
XONDC2 22 18 12 ^R Device Control 2
DC3 23 19 13 ^SDevice Control 3, normally
XOFFDC4 24 20 14 ^T Device Control 4NAK 25 21 15 ^U Negative AcKnowledgeSYN 26 22 16 ^V SYNchronous idleETB 27 23 17 ^W End Transmission BlockCAN 30 24 17 ^X CANcel lineEM 31 25 19 ^Y End of MediumSUB 32 26 1a ^Z SUBstituteESC 33 27 1b ^[ ESCapeFS 34 28 1c ^\ File SeparatorGS 35 29 1d ^] Group SeparatorRS 36 30 1e ^^ Record SeparatorUS 37 31 1f ^_ Unit Separator
Remove non‐Ascii Latin‐1 characters within your data:
Edit Distance for matching
• Edit_Distance Function:– DataFlux through its EEL exposes a function called Edit_Distance which compares two strings and returns the number of characters which would need to be changed/added/deleted or rearranged between the two strings to make them equal. Edit Distance is often referred to as a measure of how many "edits" are required to turn one text string into another, used to suggest spelling corrections.
• Input Data:
Edit Distance for matching
Name 1 Name 2 DOB 1 DOB 2 SSN 1 SSN 2
Isabell Smith Isabel Smith 02/22/1924 02/24/1923 123456789 234422352
Name1 & Name2 Diff DOB1 & DOB2 Diff SSN1 & SSN2 Diff
1 2 7
Why use Edit_Distance()??•Introduce additional layer of “fuzzy” matching•Determine difference between two strings / words•Set “likeness” thresholds for matching
Edit Distance for matching• How to use Edit_Distance()
Need to match on a portions of a field?
Ever need to match on subcomponents of a field? You want to find all customers at a specific location code but positions 8 and 13 of the code mean nothing? But the rest of the value needs to be exact.
MatchCodes won’t help. Edit Distance may give you the proper results, IN SOME CASES……. Let’s use the expression node to create a new field built using the left/right/mid functions
Expression node
Expression node
Clustering node
Without Location_Substring With Location_Substring
Data ‐ ExamplesOriginal Data
Cluster Results (Location Code or Address/City/State/Postal)
Cluster Results (Location Code or Address/City/State/Postal or Location_Substring)
Sort_Words for Matching
• sort_words Function:‐ DataFlux through its EEL exposes a function called
sort_words that performs a sort (ascending or
descending) of data within a field.
‐ The function can also eliminate a word if it is
duplicated in the field.
‐ This function becomes valuable if/when a business
requirement requires matching on a free form field
(for example material or parts descriptions).
Sort_Words (Expression Node)
Move_File after job completion
• move_file Function:‐ DataFlux through its EEL exposes a function called
move_file. The function performs a move of a file from
one directory to another.
‐ This functionality is important when input files are
processed and should be moved to a secondary location
so that they are not processed again.
‐ This could be viable as a last page in a job if it
consistently runs ‘listening’ for a file to arrive at an input
location.
move_file function
Before After
Cluster FlaggingAt times it may be necessary to identify if a record was part of a multi‐rowcluster or if it was just a single, non‐matched record.
Cluster Flagging
Data after this node
Cluster Flagging
We sort on the cluster id and sequence field (I). The sequence field is in descending order as it will be imperative to understand if a cluster has a sequence higher than 1 when identifying it as a multi‐ or single row.
Cluster Flagging
Cluster Flagging
Questions
Any of the presented topics and/or workflows can be provided. Please see the instructors following the session to obtain this information.
Top Related