We all acknowledge that companies increasingly rely on quality data. Numerous articles have highlighted the importance of building modern, reliable, and transparent data platforms to empower businesses and enhance AI capabilities. However, there’s a buzz about the death of ETL processes, suggesting a shift towards alternative approaches. At Datuum, we’ve been at the forefront of automating data pipeline creation, and we’re excited to share our insights on this evolving landscape.
The Evolution of ETL in Data Integration
Contrary to popular belief, ETL—the mainstay of data integration—is far from dead. Most organizations integrate with multiple systems, necessitating the extraction, transformation, cleaning, and mapping of data. The evolution of ETL from a niche skill to a major technology market segment is undeniable. We’ve seen the emergence of SQL, high-level ETL platforms, Spark, No-code, data pipeline tools, dbt, and AI, each bringing a new paradigm to handle increasing data volumes more efficiently and at a lower total cost of ownership (TCO).
The market now demands ETL processes to be more accessible and less reliant on specialized and costly resources. The goal is to build quality data products without constant dependence on data engineering teams. This shift reflects a broader industry trend toward democratizing data processes. I believe we have the capability to reach this pivotal point where technology fully meets these new demands.
Datuum’s Innovative Approach to Data Pipelines
Since 2021, Datuum has been pioneering a blend of no-code, AI, and automated code generation. Our aim is to revolutionize the experience of building Data pipelines for both Data Engineers and Analysts.
The Datuum Philosophy:
- AI-Driven Semantic Understanding: Use AI to semantically understand the data and automatically map data sources to destinations, ensuring compatibility.
- Automated Code Generation: Generate code that efficiently transforms source data to seamlessly integrate into the destination, optimizing the data flow process.
- Dynamic Data Pipeline Generation: Create data pipelines that effectively move, transform, and load data, optimizing for both performance and accuracy.
- User-Friendly, No-Code Interface: Provide an intuitive, no-code interface that simplifies data pipeline creation, making it accessible to users without engineering skills.
Our approach at Datuum can be likened to building a self-driving car for data. We focus on developing both the AI ‘brain’, which directs and optimizes the process, and the ‘vehicle’ – the data pipelines that execute these operations seamlessly.
Navigating Challenges in Building Smart Data Pipelines
Challenge 1. The Decision: Build or Buy?
Our primary goal was to develop the ‘brain’ for data pipelines, eliminating the need for manual code writing. Hence, our choice was to integrate with an established data pipeline platform rather than build from scratch.
Challenge 2. Choosing the Right Platform.
Our criteria were straightforward: open-source, a substantial number of connections, and a large community. After thorough research, we integrated Airbyte as our Data Pipeline platform, with Datuum as the driving intelligence.
In our journey to integrate Airbyte with Datuum, we encountered several technical challenges that tested our ingenuity and commitment to delivering a superior product.
Challenge 3. Data Type Preservation in Airbyte’s Universal Approach
Airbyte’s robust platform, while advantageous for its versatility, presented a unique challenge in its universal approach to data conversion. This process typically involves converting data to JSON format and then back to table format. Such conversions, unfortunately, risked losing crucial data type information.
For Datuum, maintaining data integrity and type accuracy was non-negotiable, as we deliver data to a predefined destination with all transformations automatically generated. To counter this, we introduced a specific metadata layer. This layer plays a pivotal role in restoring data types accurately when converting from JSON to table formats, ensuring the fidelity of the data throughout the process.
Challenge 4. Navigating Community Connector Variabilities
One of Airbyte’s strengths is its community-driven development, especially in terms of connector variety. However, this diversity also led to several inconsistencies:
- Language and Data Handling Variations: We observed that connectors developed in different programming languages, particularly Java and Python, exhibited disparities in data handling. Python-based connectors using Pandas treated certain data types, like NaN and Null values, dates, etc., differently than Java-based connectors.
- Database-Specific Connector Behaviors: Connector behavior varied significantly across different databases. For instance, in PostgreSQL, table names are automatically converted to lowercase with a 63-character limit – a rather unconventional approach. In contrast, Snowflake converts all table names to uppercase.
- Diverse Naming Conventions and Limitations: Across various databases, we encountered connectors replacing periods (‘.’) in names with underscores (‘_’), even when the RDBMS supported periods. Furthermore, naming conventions in connectors were inconsistent – some encased names in quotation marks while others did not.
- File Connector Inconsistencies: File connectors also behaved differently. For example, while the GCP connector allowed connection to an entire folder (provided all files had a .csv extension), the File connector limited connections to a single file at a time.
These variabilities posed significant challenges in creating a unified, reliable approach to data integration using community connectors. To address these, we developed custom logic for each connector, ensuring consistency and reliability in our data processing.
Challenge 5. Transition to dbt for Code Generation
Prior to our integration with Airbyte, Datuum primarily generated SQL code. Airbyte, however, advocates the use of dbt (data build tool) for data transformations. Transitioning to dbt required strategic adjustments but ultimately proved to be a worthwhile endeavor. The dbt approach aligns well with our philosophy of automated, efficient data processing, enhancing our capability to automate data pipeline creation more effectively.
Simplifying Data Onboarding Journey
As a result of overcoming these challenges, Datuum now successfully generates dbt code based on AI-assisted mapping between data sources and destinations. This process culminates in a comprehensive Airbyte data pipeline, adept at extracting, loading, and transforming data with precision and efficiency. Our journey with Airbyte has been a testament to our commitment to navigating and solving complex technical challenges in the realm of data integration.
At Datuum, we’re proud to offer a tool that simplifies the creation of Airbyte pipelines, making data integration more accessible and efficient. If you’re looking to streamline your data processes, we invite you to explore Datuum’s innovative solutions.