The hidden cost of data onboarding or why there is always an Excel file in the middle

Standard connection — standard problems

When it comes to onboarding new data sets into your data ecosystem, there’s no shortage of tools promising an easy solution. These tools often boast libraries with connections to a wide range of major systems, from Salesforce and Facebook to Quickbooks. Out-of-the-box connectors make it seem like you’re all set, right?

Well, here’s the catch: these tools primarily focus on replicating the data, and they excel at it. Names like Airbyte and Fivetran have captured a significant share of the data integration and replication market by simplifying connections to numerous sources and efficiently pulling in data. They offer user-friendly, configurable interfaces.

But here’s the crux of the matter: data replication isn’t the same as data onboarding. It’s a crucial initial step, but it’s only the tip of the iceberg.

The real work begins once you’ve replicated the data into your environment. To make the data truly usable, you need to integrate it into your data ecosystem, which could include your Data Lake, Data Warehouse, and so on. The challenge is that data replication tools have no insight into the structure of your database or APIs. Furthermore, data sources and destinations rarely align perfectly. This means that someone has to step in to transform, normalize, and ingest the new data into your schema, allowing your users and applications to derive real value from the new data source.

Image Source: https://twitter.com/nakedpastor

Data transformation is the heart of the problem

Data replication tools generally follow the ELT approach, which stands for Extract, Load, and Transform. The transformation stage is where things tend to get complex, requiring a lot of manual and time-consuming effort. For instance, receiving 200 fields from a Salesforce API doesn’t automatically make that data usable. Someone needs to make sense of it. Interestingly, about 90% of companies resort to using Excel as an intermediary mapping tool in this process. Let’s dig into this a bit more.

When we consider the whole endeavor of importing a new data set or creating a new connector for your data ecosystem, replicating data from a fresh source to your environment is crucial but only accounts for about 20% of the time and effort. The remaining 80% is typically devoted to the vital task of making this data truly usable, essentially integrating it into your existing model.

It’s essential to point out that many data replication tools offer a feature where developers can create custom transformation scripts. However, this often boils down to scripting based on those Excel files we mentioned earlier.

Imagine this scenario: You’ve got an analytics database for your Google ad campaign, and now your marketing team wants to dive into Facebook ad campaigns as well. Many tools provide ready-made connectors for Facebook ads API. But here’s the catch — these tools have no clue about the structure and purpose of your marketing department’s analysis data. So, once again, you need someone to map, transform, and integrate it into your analytical framework. And guess what? Most folks end up using Excel for this task. But let’s face it: Excel isn’t the most reliable tool for building data pipelines, and it often lands in the data engineering team’s backlog.

When folks are picking a data integration tool for data onboarding, they often sell it to the business side by promising it’s going to make bringing in data from any source a breeze — easy, fast, and cost-effective. However, as I mentioned, getting the data in is just the beginning of the data journey.

Whenever you need to integrate a standard system with even a standard system there is always an Excel file in between.

Regrettably, many business stakeholders tend to believe that a tool with numerous standard connections is a one-size-fits-all solution capable of dramatically expediting development and reducing the number of iterations required to make the data ready.

So, What’s the Solution?

Taking all this into account, it’s important to realize that while data replication tools certainly ease the burden of creating connectors, they don’t provide a complete fix for onboarding challenges. Efficiency, cost-cutting, and time-saving are undoubtedly benefits, but they’re not the whole solution. The most intricate part still lies ahead, so it’s crucial to manage your expectations accordingly.

However, there are tools designed to simplify data mapping and transformation, reducing manual efforts. Enter Datuum, a solution we’ve developed to tackle this very issue. It streamlines the process, saving your company valuable time and money by automating the often tedious tasks of manual data mapping, code generation, and building data pipelines. Datuum leverages AI to understand your data destination schema and takes care of the entire journey, from data mapping and transformation to code generation and data pipeline creation. In a nutshell, it offers an end-to-end solution for data onboarding.

The hidden cost of data onboarding or why there is always an Excel file in the middle

Standard connection — standard problems

Data transformation is the heart of the problem

So, What’s the Solution?

Book A Demo