GO Loader Aviation 1.2 is here! Take a look at what's new.
Skip to end of metadata
Go to start of metadata

Data Chunking: Filtering the Output of a Publishing Job

Chunking affects the output of GO Publisher Workflow.

By default, GO Publisher Workflow will output all of the data processed into a single file. This file can be split into 'Chunks' by identifying a 'chunking scheme' (geographic or attribute) and/or defining a target file size. Chunking’ is essential in the exchange of large data volumes. Without it, large exports would be unmanageable and prone to data corruption during file transfer.

GO Publisher Workflow currently supports the following chunking scenarios:

  • Geographic Chunking – where the Selection is written into files that share a geographic area.
  • Attribute Chunking – where the Selection is written into files that share the same attribute values.
  • File Size Chunking – where the Selection is written into files and each file written to the same size.

A chunking scheme and target file size chunking can be specified in the same job. Each type of chunking is explained below, as well as how to configure a job to use the filter.

Path Patterns

Chunking scheme jobs require "{chunk}" FILE (and METADATA, if applicable) path patterns to be specified either in the product.xml or directly in the publishing job.

File size chunking jobs require "{sequence}" path patterns.

Jobs with both require "{sequence}" and "{chunk}" path patterns.

 

For information on configuring path patterns in the product.xml and publishing job, see Path Patterns.

Geographic Chunking

Geographic chunking allows you to divide the output data of your job into segments based upon a geographic area.

Geographic chunking requires two tables, GP_CHUNK_SCHEME and GP_CHUNKS, to exist in the source database (see Publish Job Configuration Tables).

Example

This example defines a chunking scheme, 'large', which is identified in the below GP_CHUNK_SCHEME table. This scheme contains four boundaries which are added to the GP_CHUNKS table. To split the published data into separate files based on these boundaries, you can specify this 'large' chunking scheme in the publishing job. This example is based on the Treasure Island data and training we offer.

You can create multiple chunking schemes in the GP_CHUNK_SCHEME table with corresponding boundaries in the CP_CHUNKS table. For example, you could create another chunking scheme called 'small' in the GP_CHUNKING_SCHEME table and add more boundaries to the GP_CHUNKS table for this scheme.

The GP_CHUNK_SCHEME in the publishing database:

NAME
large

The corresponding GP_CHUNKS table:

SCHEMECHUNK_IDPATHEXTENT
large

 

NENMDSYS.SDO_GEOMETRY(2003,null,null,MDSYS.SDO_ELEM_INFO_ARRAY(1,1003,1),MDSYS.SDO_ORDINATE_ARRAY(35000,24600,37000,24600,37000,26600,35000,26600,35000,24600))
largeNWNMDSYS.SDO_GEOMETRY(2003,null,null,MDSYS.SDO_ELEM_INFO_ARRAY(1,1003,1),MDSYS.SDO_ORDINATE_ARRAY(37000,24600,39000,24600,39000,26600,37000,26600,37000,24600))
largeSWSMDSYS.SDO_GEOMETRY(2003,null,null,MDSYS.SDO_ELEM_INFO_ARRAY(1,1003,1),MDSYS.SDO_ORDINATE_ARRAY(37000,22600,39000,22600,39000,24600,37000,24600,37000,22600))
largeSES

MDSYS.SDO_GEOMETRY(2003,null,null,MDSYS.SDO_ELEM_INFO_ARRAY(1,1003,1),MDSYS.SDO_ORDINATE_ARRAY(35000,22600,37000,22600,37000,24600,35000,24600,35000,22600))

Column definitions:

SCHEME = The name of the chunking scheme that matches the name in the GP_CHUNK_SCHEME table.

CHUNK_ID = The ID of the specific boundary/ geographic chunk. This ID will be the name of the published file.

PATH = The name of the folder which will contain the published file.

 

In this example, the scheme name is large. When you download the finished job, the published batch of files will be split into two folders, N and S. The N folder will contain files NE and NW. The S folder will contain files SE and SW.

You can now specify the scheme name in the Job request, as shown below:

Example job to publish features chunked using a large chunking scheme
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gpa:publishingJob xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:gpa="http://www.snowflakesoftware.com/agent/go-publisher"
    xmlns:sfa="http://www.snowflakesoftware.com/agent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 
    <!-- The user will enter the job name -->
    <sfa:jobName>go-publisher-workflow-job</sfa:jobName>
 
    <!-- Priorities are 1-high, 2-medium, 3-low -->
    <sfa:priority>2</sfa:priority>
  
    <!-- Product name -->
    <gpa:product ref="sfProduct" />
         
    <!-- Additional selection scheme goes here -->
         
    <!-- Chunking scheme -->
    <gpa:geographicChunkingScheme ref="large" />
 
    <!-- Additional metadata parameters -->
 
</gpa:publishingJob>

Attribute Chunking

Attribute chunking works in a similar way to geographic chunking. However, the output is chunked based on a column name from the input data, so it is not necessary to specify the chunking scheme in separate tables.

Example

If there is an attribute (column) named area with the possible values a, b, and c, then you could chunk your published data based upon these values in this column. This would produce 3 output files; one for a, one for b and one for c.

Example job to publish features chunked using the area attribute
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gpa:publishingJob xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:gpa="http://www.snowflakesoftware.com/agent/go-publisher"
    xmlns:sfa="http://www.snowflakesoftware.com/agent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 
    <!-- The user will enter the job name -->
    <sfa:jobName>go-publisher-workflow-job</sfa:jobName>
 
    <!-- Priorities are 1-high, 2-medium, 3-low -->
    <sfa:priority>2</sfa:priority>
  
    <!-- Product name -->
    <gpa:product ref="sfProduct" />
         
    <!-- Additional selection scheme goes here -->
         
    <!-- Chunking scheme -->
    <gpa:attributeChunkingScheme ref="area" />
 
    <!-- Additional metadata parameters -->
 
</gpa:publishingJob>

File Size Chunking

File-size chunking enables you to specify a target size for your output files. When you define a target chunk size in the job, GO Publisher Workflow will output files that each contain enough features to reach the target size.

The specified target value is for the uncompressed size, and is in megabytes. The maximum value is one petabyte.

The file size is a "target" rather than a maximum value.

On average, files will be larger than the target file size. Once the target size is reached, GO Publisher will stop adding new features but will continue adding any already started features.

If you define a file size that is not large enough for each file to contain a single feature, single-feature chunks will be produced that are larger than the specified size.

Example

The following job limits the size of outputted data files to a target of 1MB:

Example job to publish files with a target size of 1MB
 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gpa:publishingJob xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:gpa="http://www.snowflakesoftware.com/agent/go-publisher"
    xmlns:sfa="http://www.snowflakesoftware.com/agent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 
    <!-- The user will enter the job name -->
    <sfa:jobName>go-publisher-workflow-job</sfa:jobName>
 
    <!-- Priorities are 1-high, 2-medium, 3-low -->
    <sfa:priority>2</sfa:priority>
  
    <!-- Product name -->
    <gpa:product ref="sfProduct" />
         
    <!-- Additional selection scheme goes here -->
 
    <!-- Additional chunking scheme goes here -->

    <gpa:targetFileSize>1</gpa:targetFileSize>
 
    <!-- Additional metadata parameters -->
 
</gpa:publishingJob>

 

Further Reading

Interested in filtering the data retrieved from your source database? See Data Selection for more information.

  • No labels