How to hack the way to efficient data extraction

Extracting data from existing databases is a complex task the Data Engineering team handles. Legacy systems present additional challenges, such as inconsistent data modeling and limited documentation. Despite these obstacles, quick results are expected. This article shares insights and tips to navigate these challenges and save time.

Extracting data from an existing database is one of the most challenging and tedious jobs that typically fall on the plate of the Data Engineering team. And when it comes to dealing with true legacy systems, well, let’s just say it adds an extra layer of intrigue to the mix. Unraveling the secrets left behind by generations of engineers is like solving a complex puzzle. Inconsistent data modeling, along with potential challenges like insufficient documentation, limited knowledge of the existing team, and unclear semantics, are just a glimpse of the challenges ahead. Yet, amidst these daunting obstacles, the team is expected to deliver results with lightning speed.

In this article, we aim to impart insights gained through hard-earned experience, offering a lifeline that may save you precious time.

So, let us explore these pivotal non-technical starting points:

  • Identify a knowledge holder: Even if a system/database is very old, there is usually at least one person who has some understanding of how it works and where data is coming from. This individual can be an invaluable resource to you, but keep in mind that they may be busy, have knowledge gaps, or may not like your project at all. Nevertheless, building a positive relationship with this person can be critical to your success. They can help you save significant time and effort. Do not criticize the current state of things, even if there are obvious flaws and improvement opportunities. Focus on extracting the data as efficiently as possible with minimum dependencies on the source system.
  • Manage project stakeholders’ expectations: It’s important to educate stakeholders about the nature of data extraction projects. Many people mistakenly assume it’s a one-time task, often due to a lack of experience in this area. However, unless you’re working with a very small database, it’s likely to require multiple iterations to ensure that the data is complete and accurate. Therefore, it’s essential to communicate realistic timeframes and the need for iterative processes to stakeholders. It’s a very challenging task to estimate that effort. So try to find the most critical piece of data you can focus on initially.
  • Establish a shared understanding of semantics with both the users of the source system and the end users of the data: In any domain, people tend to use different terms for the same things. Therefore, involving those who use the source system can help you explain the data behavior and identify any data anomalies that may arise. For instance, when we had to extract data from a widely-used EHR system, it was a surprise that patient information was stored in a table named ‘users.’ By collaborating with the end-users and the data source users, we were able to address such semantic inconsistencies and prevented any confusion down the line.
  • Establish early success criteria: There are always some standardized reports generated from the legacy system. Typically you can use those reports as a set of reconciliation metrics. Don’t be surprised that you discover some inconsistencies and/or errors during your reconciliation process. But it’s always helpful to establish some baseline to evaluate the fullness and quality of the data.

Ok, and what about the technical side of the project? There are many approaches and tools you can use to do the job. From plain SQL all the way to no-code tools. It’s up to you and your arsenal of skills, but here are some high-level suggestions that typically help:

  • Change Data Capture (CDC): It’s essential to understand how the data changes over time, including retrospective changes, in order to establish reliable Change Data Capture for the tables you’re interested in. Many legacy databases don’t have an OOTB CDC feature, so you may have to deal with a full data refresh. An alternative would be to use a data replication tool with CDC capabilities integrated. Tools like Airbyte and Fivetran may help to save time there.
  • Analyze the workload to identify the most and least used objects/tables: Table names alone can be misleading, so analyzing the database workload can help you better understand which tables are critical for your extraction process. It can help you to identify critical objects in the data model.
  • Extract and store data in its raw form: It’s always a good practice to extract the raw data and store it as is. Don’t perform any transformations in that step. Use your Data Lake or Staging area to store it until the iteration/project is complete. This minimizes the performance impact on the source database. This also enables you to compare the extracted data from the previous run to the current version, helping to identify changes you may have missed during implementation. Another advantage is the ability to rerun transformations without impacting the source database.
  • Structure transformations in multiple stages: When it comes to transforming the extracted data, it’s best to break it down into multiple steps rather than cramming everything into a single, complex set of instructions. This helps with the maintainability of the data transformation code and makes it easier for you and your team to understand what’s going on later on. Overcomplicated code can be a hindrance, especially when revisiting it after several iterations. Good documentation and structured code can save you and your team time.
  • Consider using a no-code tool: Maintaining both the code and the documentation can be a lot of work. Having a no-code tool that can handle data lineage, transformations, and documentation can be helpful. When selecting a tool, always consider who will maintain the pipeline and how to make that process as efficient as possible. How technical that person will be. Some tools require more technical skills than others.
  • Create a Semantic Model: A semantic model can be extremely useful when integrating data from multiple sources into a single data model. Normalizing data and using transformations that can be demonstrated are highly beneficial features. This approach ensures that your data is well-structured and easy to understand, making it more accessible and simpler for data users to access and analyze.

Data extraction can be challenging, but following these key takeaways can help you avoid issues, especially if you’re doing it for the first time. We intentionally skipped all parts related to security and compliance and assumed you have access to the database.

It’s essential to build strong relationships with knowledge holders, manage project stakeholders’ expectations, establish a common definition of semantics with data users, set clear success criteria, and structure transformations in multiple stages. No-code automation tools like Datuum and data replication tools like Airbyte or Fivetran can simplify the process.

Documenting transformations, maintaining data lineage, and creating a semantic model is crucial to establishing a sustainable and maintainable data pipeline. These steps will benefit the entire organization in the long run.