Product

Building for Success: Konfir's MLOps Data Transfer Strategy

Introduction

Over the last few months, Konfir has shifted focus from returning data to helping client make a decision on an employment verification given the data available. That’s why we released the Employment Timeline, and is why we’re now investing heavily in our Machine Learning capabilities.

In pursuit of streamlined Machine Learning Operations (MLOps) and to ensure a secure and organised data management approach, at Konfir we made the strategic decision to migrate from a single AWS account to a three-account infrastructure. Previously, all data-related work was consolidated within a single account. Now, the new setup includes separate AWS accounts for Production, Data Science, and Data Lake, each fulfilling a distinct role in our MLOps ecosystem. It became imperative to adopt AWS Well-Architected Framework and consider essential factors to maintain data integrity, adhere to stringent security standards, optimise costs, and set the stage for scalable and efficient data management.

By implementing this multi-account architecture, Konfir extends significant benefits to our clients and internal teams. The separation into three different accounts extends further the best practices around data management: fine access controls allowing granting specific permissions to specific teams; better cost management as each account is monitored and optimised separately; separation of concerns allowing each account to be used for different use cases; isolated environments reducing the risk of interference between different workloads. 

While having one AWS account is easy to manage—allowing straightforward access to resources within the account—transitioning to a multi-account architecture necessitates additional configuration. In this blog post, we will delve into the process of setting up cross-account S3 permissions, discussing the considerations and best practices involved in the migration to this architecture.

Setup S3 Cross Account Permissions

AWS provides an excellent article on the topic. 

The following is a summary of the steps we took to transfer the data that was previously in Production account to our Data Lake account:

  1. Create a role in Production account that will be used for the data transfer.
  2. Give permissions to Production role to read from the Prod bucket.
  3. Create a bucket in Data Lake account.
  4. If you wish to enable encryption in your Data Lake bucket, you will need to use AWS Key Management Service (KMS) and encrypt the bucket with Customer Managed Keys (CMK). The default Server Side Encryption S3 (SSE-S3) is not compatible with cross account permissions.
  5. Edit Data Lake bucket policy to allow the role in Production account to write objects.
  6. Edit Data Lake KMS key policy to allow the role in Production account to encrypt objects using the key.
  7. Edit role in Prod account to allow writing objects in DL bucket.
  8. Edit role in Prod account to allow to use the key to encrypt objects using the key.

Importantly, AWS requires the two accounts to allow the role in Production to write in the bucket Data Lake. It is not enough to allow one side only. 

Using Terraform simplifies this process as it is possible to reference resources from the other account with the ARN of the resource.

Copy S3 objects

AWS offers two primary methods for copying data across S3 buckets using the CLI: aws cp and aws sync. The main difference is that copy (cp) will copy all the objects regardless if they already exist in the destination (overwrite). sync scans the destination and copies files from the source only if they are new or updated. We needed to move everything so we used cp.

Considerations
  • Evaluating Data Transfer Costs: Copying data across AWS accounts incurs data transfer costs, which can impact the overall budget. Consider the pricing model for PUT and GET requests and data transfer out of the S3 bucket, optimising costs based on data volume and transfer frequency. KMS costs should also be considered.
  • Performance (1): Our bucket of data was around 40GB,  consisting of many small files (~100KB). Locally the transfer had a performance of ~400KB/s. Unfortunately the time to process was greater than the allowed time to assume the role (max 12 hours). The CLI stopped and we had to find another way to copy the data.
  • Performance (2): After the learnings of the previous point, we set up an EC2 instance in production with an associated instance role that had access to Data Lake bucket and key. The performance went up to 780KB/s. Although slow, the data transfer could be completed as in this case the EC2 instance did not have to assume a role.
  • Performance (3): We also had to copy one sizeable file of 2.5GB. The speed reached 215MB/s. Data transfer of small files is always slower than one large file as it introduces overhead (read/write, TCP, sessions, scan for viruses, …)
  • Leveraging Glue Job for Data Copy: As data volumes grow, manual data copying with the command-line approach is likely to be too cumbersome. In the future, we’ll be using AWS Glue to automate and orchestrate data copying. Glue Jobs enable seamless data transformation and migration workflows, streamlining the process and improving scalability.
  • Leveraging AWS Data Lake Formation: Data Lake Formation simplifies data lake creation, governance and sharing, enabling data engineers and data scientists to focus on analysis rather than data preparation.

Conclusion

Copying S3 buckets across AWS organisations in an MLOps environment requires careful consideration and adherence to best practices. By setting up IAM permissions, using encryption, optimising data transfer costs, organisations can ensure a secure, efficient, and cost-effective data transfer process. Furthermore, exploring future automation with Glue Jobs offers a scalable solution to handle data copying as data volumes increase, paving the way for successful machine learning deployments and operations.