Treehouse Dataflow Toolkit (TDT): Data Transfer to Analytics/ML/AI-friendly Targets

TDT provides an easy and fast approach that enables rapid data transfer from Kafka pipelines to Amazon Redshift, Snowflake, Amazon Athena/S3, Amazon S3 Express One Zone, and Amazon Aurora PostgreSQL.

Treehouse Dataflow Toolkit (TDT) is a set of proprietary microservices that assure highly-available, auto-scalable, and event-driven data transfers to your data science teams’ favorite analytics frameworks.

Customers either already have, or are in the process of acquiring, software tools that replicate their data into Kafka pipelines (i.e., Amazon MSKConfluent, etc.). Our new and innovative offering, TDT, provides the turn-key solution for getting this data from Kafka into advanced Analytics/AI/ML-friendly targets, such as Amazon RedshiftSnowflakeAmazon Athena/S3Amazon S3 Express One Zone Buckets, as well as Amazon Aurora PostgreSQLall the while adhering to AWS’s and Snowflake’s recommended best practices for massive data loading, thus assuring shortest and surest loads.

How does TDT Work?

When a mainframe data replication tool (provided by one of Treehouse’s partners) publishes both bulk-load and CDC data in JSON format to a reliable and scalable framework like Kafka, it sets the stage for TDT to feed legacy data from Kafka to any number of JSON-friendly ETL tools, target datastores, and data analytics packages (some of which may not even have been invented yet!).


  1. We start at the source – the mainframe – where an agent (with a very small footprint) extracts data (in the context of either bulk-load or CDC processing).
  2. The raw data is securely passed from the mainframe by one of our partner’s data replication tools that transforms the data into Unicode/JSON and publishes the results to a Kafka topic (in our example above, a topic in an Amazon MSK cluster).
  3. TDT microservices consume the data from MSK/Kafka and land it in S3 buckets, where TDT’s proprietary crawler technology is used to automatically prepare landing tables, views, and additional infrastructure for various analytics friendly targets.  Then the mainframe data is loaded into Redshift, Snowflake, S3, or PostgreSQL (all the while adhering to AWS’s and Snowflake’s recommended “best practices” for massive data loading, thus assuring shortest and surest loads).  The inherent reliability and scalability of the entire pipeline infrastructure assures near-real-time synchronization between mainframe sources and the target tables, even with huge bulk-loads or transaction-heavy CDC processing.

History is enterprise GOLD…

TDT not only keeps things up to date faster than any conceivable ODBC-based solution, but the “delta tables” into which it loads data also inherently retain the entire history of source data ever since mainframe-to-target synchronization began.  So, for example, after TDT has been syncing a target table for 5 years, a data scientist now has 5 years’ worth of historical data to work with for trend analysis, predictive analytics, prescriptive analytics, ML, etc.

…but you also need the very latest data in near-real-time.

While TDT’s unique “delta-tables” approach offers comprehensive “history” for advanced analytics, the traditional need for up-to-the-second, current snapshots of mainframe datastores is also completely provided for.  Adhering once again to target vendors’ “best practices”, self-materializing views are provided to work with current data, not only in the JSON format in which it is stored, but also in fully-structured views which provide the more traditional look and feel of a SQL database.

Treehouse Dataflow Toolkit (TDT) is Copyright © 2024 Treehouse Software, Inc. All rights reserved.


Download the AWS TDT Product Brief

Request a Demo

 

Office Location

2605 Nicholson Road, Suite 1230
Sewickley, PA 15143
USA

Contact Us

General Email:
tsi@treehouse.com
Sales Department:
sales@treehouse.com
Support Center:
support@treehouse.com


Connect with Treehouse