ab-initio Parallel processing

ab-initio Parallel
processing
Component Parallelism
Program components execute
simultaneously on different
branches of the graph
Eg: Sort customer and transaction on
separate branches of the same graph
then join it using key
Pipeline Parallelism
Occurs when a connected sequence of program components
on the same branch of a graph execute simultaneously.
Not for all the ab-initio component
Pipeline parallelism is broken
when Sort component used
(It needs to read all the input rec to sort before writing out)
Data Parallelism
Partitions
Partition components
Partition By Key
It distributes data records to its output flow partitions according to key values
Runtime behaviour
Reads records in arbitrary order from the in port
Distributes them to the flows connected to the out port,
according to the key parameter, writing records with the
same key value to the same output flow
Parameter
key
Refer Partition by key and sort component also
Partition by Expression
It distributes data records to its output flow
partitions according to a specified DML expression.
Runtime behaviour
Reads records in arbitrary order from the flows connected to the in port
Distributes the records to the flows connected to the out port,
according to the expression in the function parameter
Parameter
Function
The expression must evaluate to a number between 0 and the number of flows connected to the out port minus 1.
Partition by Expression routes the record to the flow number returned by this expression.
Flow numbers start at 0.
Ex: DML expr: zipcode/10000 so 30338/10000 goes to 3rd output port
Partition by Range
Tightly coupled with
Find Splitters
Component
sorts data records according to a key specifier,
and then finds the ranges of key values that divide
the total number of input data records approximately
evenly into a specified number of partitions.
Parameters
Key
Name(s) of the key field(s) and the sequence specifier(s) you want
Find Splitters to use when it orders data records and sets splitter points.
num_partitions
Number of partitions into which you want to divide the total number of data records evenly.
Runtime behaviour
Reads records from the in port
Sorts the records according to the key specifier in the key parameter
Writes a set of splitter points to the out port in a format suitable for the split port of Partition by Range
Typically, you route the output from the out port of Find Splitters to the split port of PARTITION BY RANGE.
It has 2 input ports
IN port
split port
Parameter
Key
Note: The field(s) specified must exist in the record formats for both the
in and split ports, and must be of the same type in both record formats.
Runtime behaviour
Reads splitter records from the split port, and assumes that these records are sorted according to the key parameter.
Determines whether the number of flows connected to the
out port is equal to n (where n-1 represents the number of splitter records).
If not, Partition by Range writes an error message and stops the execution of the graph.
Reads data records from the flows connected to the in port in arbitrary order.
Distributes the data records to the flows connected to
the out port according to the values of the key field(s), as follows:
Assigns records with key values less than or equal
to the first splitter record to the first output flow.
Assigns records with key values greater than the first splitter record,
but less than or equal to the second splitter record to the second output flow, and so on.
Important Consideration with this component and find splitters
Use the same key specifier for both components.
Make the number of partitions on the flow connected
to the out port of Partition by Range the same as the value
in the num_partitions parameter of Find Splitters.
Partition with Load Balance
distributes data records to its output flow partitions
by writing more records to the flow partitions that consume records faster
No Parameter
Run time behaviour
Reads records in arbitrary order from the flows connected to its in port
Distributes those records among the flows connected to its out port
by sending more records to the flows that consume records faster
Partition with Load Balance writes data
records until each flow's output buffer fills up.
Important Note
Although Partition with Load Balance balances the workload
between CPUs, the resulting number of data records in each partition can be unbalanced.
You can use PARTITION BY ROUND-ROBIN to balance the number of data records among partitions.
Partition by round robin
distributes blocks of data records evenly to each output flow in round-robin fashion
Parameter
Block size
Number of records distributed to one flow before distributing the same number to the next flow.
Default is 1.
Runtime behaviour
Reads records from the in port.
Distributes them in block_size chunks to its output flows according to the order in which the flows are connected
Partition by percentage
distributes a specified percentage of the total number of input data records to each output flow
Parameter
Percentage
List of percentages between 1 to 100, seperated by comma
You can assign a different percentage to each output flow
Runtime behaviour
Reads records from the in port
Writes a specified percentage of the
input records to each flow on the out port
Peculiar Input port
It contains IN port and pct port
By connecting the output of any component that produces
a list of percentages to the pct port of Partition by Percentage.
Use decimal('\n') as the record format for the pct port of Partition by Percentage.
Thus two ways to specify %
1 by percentage parameter and another by inputing 'pct' input port
Broadcast
Broadcast arbitrarily combines all the data records it receives
into a single flow and writes a copy of that flow to each of its
output flow partitions.
Runtime behaviour
Reads records from all flows on the in port
Combines the records arbitrarily into a single flow
Copies all the records to all the flow partitions connected to the out port
Use
Use Broadcast to increase data parallelism when you have connected
a single fan-out flow to the out port or
to increase component parallelism when you have connected
multiple straight flows to the out port.
Using Parallel files
Multifile
Structure
Control file
Data files which are located on different disk
Ad-hoc multifile
Referencing multifile
Eg; mfile:/ab1/initi/test.dat
Multifile co-op commands
m_mkfs - Create multifile system
m_rmfs - remove multifile system
m_mkdir, m_rmdir-Create and remove multidir
m_cp - Copy multifile
m_mv - move multifile
m_chmod - Change mode
m_touch - Create empty multifile
m_ls - List multifie
Tells dataskew %
m_du - Printing disk usage
Size of diskusage in KB
-partitions - size for its all partitions
m_df - Printing information about multifile system
m_expand - prints to stdout various information
about the multifile, multidirectory, file, or directory
Flow
Straight
Connect the components with same depth of parallelism.
Parallel -> Parallel, serial -> serial
Fan-in
connects a component with a greater depth of
parallelism to one with a lesser depth.
Kind of many to one relationships (Not always, check next branch)
Eg; 4 way to serial, 4 way to 2 way
You can only use a fan-in flow when the result of dividing
the greater number of partitions by the lesser number of
partitions is an integer.
If this is not the case, you must use an all-to-all flow.
Fan-out
connects a component with a lesser number of partitions
to one with a greater number of partitions
Kind of One to Many relationships (not always, check next branch)
Eg: serial to 4 way parallel
2 way parallel to 4 way parallel
All-to-All
Happens in 2 conditions
Connect components with different numbers of partitions,
when the result of dividing the greater number of partitions
by the lesser number is not an integer
Repartition data, using components with the
same or different numbers of partitions
Not all the components have this facility
Should go with partition-repartition
or departition components to acheive this
Repartitioning
Changing one or both of the following
The degree of parallelism of partitioned data
The grouping of records within the partitions of partitioned data
Why repartitiong
Read partitioned data file by greater no of partitioning program component to increase the processing speed
Connecting 2 processing stages having two different degree of parallelism
Load balance on different CPU
To perform global sort
Example
Sorting Multifile
Departitioning
Concatenate
appends multiple flow partitions of data records one after another
No Parameter
Runtime behaviour
Reads all the data records from the first flow connected to the in port
(counting from top to bottom on the graph) and copies them to the out port.
Then reads all the data records from the second flow connected
to the in port and appends them to those of the first flow, and so on.
Automatic flow buffering should be on to avoid dead lock
No default record assignment. I/P record format should be identical to O/P
Merge
combines data records from multiple flow partitions that
have been sorted according to the same key specifier,
and maintains the sort order
Parameter
Key
Caution
combines data records from multiple flow partitions that have
been sorted according to the same key specifier, and maintains the sort order
Normally used after sort compoents
Gather
combines data records from multiple flow partitions arbitrarily
Runtime behaviour
Reads data records from the flows connected to the in port.
Combines the records arbitrarily.
Writes the combined records to the out port.
Usage
Reduce data parallelism, by connecting a single fan-in flow to the in port
Reduce component parallelism, by connecting multiple straight flows to the in port
No parameters
No gather for sort component
You do not need to use a Gather component when connecting
a fan-in or all-to-all flow to the in port of a Sort ,
because Sort can gather internally on its in port.
Interleave
combines blocks of data records from multiple flow partitions in round-robin fashion
Runtime Behaviour
Reads the number of data records specified in the blocksize parameter from the first flow connected to the in port
Reads the number of data records specified in the blocksize parameter from the next flow, and so on
Writes the records to the out port
Parameter
Blocksize
Just like partition by round robin
It is related to partition by round robine
Can cause deadlock
LAYOUT
What is?
The location of files
The number and locations of the partitions of multifiles
The number of, and the locations in which, the partitions of program components execute
Critical Concerns
The Co>Operating System must be installed on the computers specified by the layout.
The run host must be able to connect to the computers specified by the layout.
The layout must allow enough space for the files the graph needs to write there.
The permissions in the directories of the layout must allow the graph to write files there.
Who uses layout?
Intermediate file component
Phase, checkpoint, watcher
Buffered flow
Many programming components - like sort
Depth Of parallelism
Dataskew
m_du,m_df,m_ls can be used to identify this
12