Live Webinar and Q&A: Panel: 2023 Data Engineering Trends and Predictions (January 19, 2023) Save Your Seat
Facilitating the Spread of Knowledge and Innovation in Professional Software Development
2022 was another year of significant technological innovations and trends in the software industry and communities. The InfoQ podcast co-hosts met last month to discuss the major trends from 2022, and what to watch in 2023.. This article is a summary of the 2022 software trends podcast.
Anna Shipman discusses her experience joining the FT to lead on FT.com a few years after launchm and shares things implemented to stop the drift towards an unmaintainable system and another rebuild.
Sara Bergman introduces the field of green software engineering, showing options to estimate the carbon footprint and discussing ideas on how to make Machine Learning greener.
In this podcast Shane Hastie spoke to Melissa Daley, Bob Crews and Adam Sandman, about the state of testing and how to instil a culture of quality into software teams
GitHub Actions is an effective CI tool. However, integrating it into enterprise organizations can be challenging. This article looks at best practices for GitHub Actions in the enterprise.
Learn how to achieve high-level observability without picking and choosing which logs to collect. Register Now.
Adopt the right emerging trends to solve your complex engineering challenges. Register Now.
Your monthly guide to all the topics, technologies and techniques that every professional needs to know about. Subscribe for free.
InfoQ Homepage Articles DynamoDB Data Transformation Safety: from Manual Toil to Automated and Open Source
Nov 23, 2022 15 min read
by
reviewed by
When designing a product to be a self-serve developer tool, there are often constraints – but likely one of the most common ones is scale. Ensuring our product, Jit – a security-as-code SaaS platform, was built for scale was not something we could embed as an afterthought, it needed to be designed and handled from the very first line of code.
We wanted to focus on developing our application and its user experience, without having challenges with issues and scale be a constant struggle for our engineers. After researching the infrastructure that would enable this for our team – we decided to use AWS with a serverless-based architecture.
AWS Lambda is becoming an ever-popular choice for fast-growing SaaS systems, as it provides a lot of benefits for scale and performance out of the box through its suite of tools, and namely the database that supports these systems, AWS’s DynamoDB.
Presented by: Dunith Dhanushka – Senior Developer Advocate, Redpanda Data
Save your seat
One of its key benefits is that it is already part of the AWS ecosystem, and therefore this abstracts many of the operational tasks of management and maintenance, such as maintaining connections with the database, and it requires minimal setup to get started in AWS environments.
As a fast-growing SaaS operation, we need to evolve quickly based on user and customer feedback and embed this within our product. Many of these changes in application design have a direct impact on data structures and schemas.
With rapid and oftentimes significant changes in the application design and architecture, we found ourselves needing to make data transformations in DynamoDB very often, and of course, with existing users, it was a priority that this be achieved with zero downtime. (In the context of this article Data Transformation will refer to modifying data from state A to state B).
In the spirit of Brendon Moreno from the UFC:
Maybe not today, maybe not tomorrow, and maybe not next month, but only one thing is true, you will need to make data transformations one day, I promise.
Yet, while data transformation is a known constant in engineering and data engineering, it remains a pain point and challenge to do seamlessly. Currently, in DynamoDB, there is no easy way to do it programmatically in a managed way, surprisingly enough.
While there are many forms of data transformation, from replacing an existing item’s primary key to adding/removing attributes, updating existing indexes – and the list goes on (these types are just a few examples), there remains no simple way to perform any of these in a managed and reproducible manner, without just using breakable or one-off scripting.
Below, we are going to dive into a real-world example of a data transformation process with production data.
Let’s take the example of splitting a “full name” field into its components “first name” and “last name”. As you can see in the example below, the data aggregation currently writes names in the table with a “full name” attribute. But let’s say we want to transform from a full name, and split this field into first and last name fields.
BeforeId
FullName
123
Guy Br
AfterId
FirstName
LastName
123
Guy
Br
Looks easy, right? Not so, to achieve just this simple change these are the steps that will need to be performed on the business logic side, in order to successfully transform this data.
But let's discuss some of the issues you would need to take into account before you even get started, such as – how do you run and manage these transformations in different application environments? Particularly when it’s not really considered a security best practice to have access to each environment. In addition, you need to think about service dependencies. For example, what should you do when you have another service dependent on this specific data format? Your service needs to be backward compatible and still provide the same interface to external services relying on it.
When you have production clients, possibly one of the most critical questions you need to ask yourself before you modify one line of code is how do you ensure that zero downtime will be maintained?
Some of the things you’d need to plan for to avoid any downtime is around testing and verification. How do you even test your data transformation script? What are some good practices for running a reliable dry run of a data transformation on production data?
There are so many things to consider before transforming data.
Now think that this is usually, for the most part, done manually. What an error-prone, tedious process! It looks like we need a fine-grained process that will prevent mistakes and help us to manage all of these steps.
To avoid this, we understood we’d need to define a process that would help us tackle the challenges above.
Figure 1: Rewrite Process Flow Chart
First, we started by adjusting the backend code to write the new data format to the database while still keeping the old format, by first writing the FullName, FirstName and LastName to provide us some reassurance of backward compatibility. This would enable us to have the ability to revert to the previous format if something goes terribly wrong.
Link to GitHub
Next, we wrote a data transformation script that scans the old records and appends the FirstName and LastName attributes to each of them, see the example below:
Link to GitHub
After writing the actual script (which is the easy part), we now needed to verify that it actually does what it’s supposed to. To do so, the next step was to run this script on a test environment and make sure it works as expected. Only after the scripts usability is confirmed, it could be run on the application environments.
The last phase is the cleanup, this includes taking the plunge and ultimately deleting the FullName column entirely from our database attributes. This is done in order to purge the old data format which is not used anymore, and reduce clutter and any future misuse of the data format.
Link to GitHub
Lets quickly recap what we have done in the process:
This well-defined process helped us to build much-needed safety and guardrails into our data transformation process. As we mentioned before, with this process we were able to avoid downtime by keeping the old format of the records until we don’t need them anymore. This provided us with a good basis and framework for more complex data transformations.
Now that we have a process––let’s be honest, real-world data transformations are hardly so simple. Let’s assume, a more likely scenario, that the data is actually ingested from an external resource, such as the GitHub API, and that our more advanced data transformation scenario actually requires us to ingest data from multiple sources.
Let’s take a look at the example below for how this could work.
In the following table, the GSI partition key is by GithubUserId.
For the sake of this data transformation example, we want to add a “GithubUsername” column to our existing table.
BeforeId
GithubUserId
123
7654321
AfterId
GithubUserId
GithubUsername
123
7654321
Guy7B
This data transformation looks seemingly as straightforward as the example with the full name, but there is a little twist.
How can we get the Github username if we don’t have this information? We have to use an external resource, in this case, it's the Github API.
GitHub has a simple API for extracting this data (you can read the documentation here). We will pass the GithubUserId and get information about the user which contains the Username field that we want.https://api.github.com/user/:id
The naive flow is similar to the full name example above:
However, in contrast to our previous flow, there is an issue with this naive flow. The flow above is not safe enough. What happens if you have issues while running the data transformation when calling the external resource? Perhaps the external resource will crash / be blocked by your IP or is simply unavailable for any other reason? In this case, you might end up with production errors or a partial transformation, or other issues with your production data.
What can we do on our end to make this process safer?
While you can always resume the script if an error occurs or try to handle errors in the script itself, however, it is important to have the ability to perform a dry run with the prepared data from the external resource before running the script on production. A good way to provide greater safety measures is by preparing the data in advance.
Below is the design of the safer flow:
Only after we do this, we scan the user records, get GithubUsername for each of them using Github API, append it to a JSON Object `{ [GithubUserId]: GithubUsername }` and then write that JSON to a file.
This is what such a flow would look like:
Link to GitHub
Next we scan the user records (get GithubUsername by GithubUserId for each record using Preparation Data), and move ahead to updating the record.
Link to GitHub
And finally, like the previous process, we wrap up by running the script on the testing environment, and then the application environments.
Once we built a robust process that we could trust for data transformation, we understood that to do away with human toil and ultimately error, the best bet would be to automate it.
We realized that even if this works for us today at our smaller scale, manual processes will not grow with us. This isn’t a practical long-term solution and would eventually break as our organization scales. That is why we decided to build a tool that would help us automate and simplify this process so that data transformation would no longer be a scary and painful process in the growth and evolution of our product.
Every data transformation is just a piece of code that helps us to perform a specific change in our database, but these scripts, eventually, must be found in your codebase.
This enables us to do a few important operations:
By enabling automation for data transformation processes, you essentially make it possible for every developer to be a data transformer. While you likely should not give production access to every developer in your organization, applying changes is the last mile. When only a handful of people have access to production, this leaves them with validating the scripts and running them on production, and not having to do all of the heavy lifting of writing the scripts too. We understand it consumes more time than needed for those operations and it is not safe.
When the scripts in your codebase and their execution are automated via CI/CD pipelines
other developers can review them, and basically, anyone can perform data transformations on all environments, alleviating bottlenecks.
Now that we understand the importance of having the scripts managed in our codebase, we want to create the best experience for every data-transforming developer.
Every developer prefers to focus on their business logic – with very few context disruptions and changes. This tool can assist in keeping them focused on their business logic, and not have to start from scratch every time they need to perform data transformations to support their current tasks.
For example – dynamo-data-transform provides the benefits of:
Quick Installation for serverless:
The package can be used as a standalone npm package see here.
To get started with DynamoDT, first run:npm install dynamo-data-transform --save-dev
To install the package through NPM (you can also install it via…)
Next, add the tool to your serverless.yml Run:npx sls plugin install -n dynamo-data-transform
You also have the option of adding it manually to your serverless.yml:plugins:
- dynamo-data-transform
You can also run the command:sls dynamodt --help
To see all of the capabilities that DynamoDT supports.
Let’s get started with running an example with DynamoDT. We’ll start by selecting an example from the code samples in the repo, for the sake of this example, we’re going to use the example `v3_insert_users.js`, however, you are welcome to test it out using the examples you’ll find here.
We’ll initialize the data transformation folder with the relevant tables by running the command: npx sls dynamodt init --stage local
For serverless (it generates the folders using the resources section in the serverless.yml):
The section above should be in serverless.yml
The data-transformations folder generated with a template script that can be found here.
We will start by replacing the code in the template file v1_script-name.js with:
Link to GitHub
For most of the regular data transformations, you can use the util functions from the dynamo-data-transform package. This means you don’t need to manage the versions of the data transformation scripts, the package will do this work for you. Once you’ve customized the data you’ll want to transform, you can test the script using the dry run option by running:npx sls dynamodt up --stage local --dry
The dry run option prints the records in your console so you can immediately see the results of the script, and ensure there is no data breakage or any other issues.
Once you’re happy with the test results, you can remove the –dry flag and run it again, this time it will run the script on your production data, so make sure to validate the results and outcome.
Once you have created your data transformation files, the next logical thing you’d likely want to do is add this to your CI/CD. To do so add the command to your workflow/ci file for production environments.
The command will run immediately after the `sls deploy` command, which is useful for serverless applications.
Finally, all of this is saved, as noted above so if you want to see the history of the data transformations, you can run:
`npx sls dynamodt history –table UserExample –stage local`
The tool also provides an interactive CLI for those who prefer to do it this way.
And all of the commands above are supported via CLI as well.
With Dynamo Data Transform, you get the added benefits of being able to version and order your data transformation operations and manage them in a single place. You also have the history of your data transformation operations if you would like to roll back an operation. And last but not least, you can reuse and review your previous data transformations.
We have open-sourced the Dynamo Data Transform tool that we built for internal use to perform data transformations on DynamoDB and serverless-based environments and manage these formerly manual processes in a safe way.
The tool can be used as a Serverless Plugin and as a standalone NPM package.
Feel free to provide feedback and contribute to the project if you find it useful.
Figure 2: Data Transformation Flow Chart
Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We’d love to have more people join our team.
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example
We protect your privacy.
You need to Register an InfoQ account or Login or login to post comments. But there’s so much more behind being registered.
Get the most out of the InfoQ experience.
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p
by Rob Finneran,
by Rob Finneran,
Your message is awaiting moderation. Thank you for participating in the discussion.
Great article on an often-overlooked topic! I am always looking for more tools and more ideas about DynamoDB since it is such an important part of my AWS tool set. Here are a couple of related ideas that I have been thinking about: (1) Schema control or versioning of the JSON structure of the DynamoDB rows, and (2) data versioning and related topics like optimistic concurrency, and (3) AWS Glue crawlers and related mechanisms that allow joining DynamoDB “tables” with other data sources in a SQL JOIN command. Also, check out tools like Dynobase and let us all know of your other finds!
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example
We protect your privacy.
Real-world technical talks. No product pitches.
Practical ideas to inspire you and your team.
March 27-29, 2023. Attend in-person or online.
QCon London brings together the world’s most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices.
Level-up on 15 major software and leadership topics including Modern Frontend Development and Architecture, Enhancing Developer Productivity and Experience, Remote and Hybrid Work, Debugging Production, AI/ML Trends, Data Engineering Innovations, Architecture in 2025, and more.
SAVE YOUR SPOT NOW
InfoQ.com and all content copyright © 2006-2023 C4Media Inc.
Privacy Notice, Terms And Conditions, Cookie Policy