
Over the last few months, Konfir has shifted focus from returning data to helping client make a decision on an employment verification given the data available. That’s why we released the Employment Timeline, and is why we’re now investing heavily in our Machine Learning capabilities.
In pursuit of streamlined Machine Learning Operations (MLOps) and to ensure a secure and organised data management approach, at Konfir we made the strategic decision to migrate from a single AWS account to a three-account infrastructure. Previously, all data-related work was consolidated within a single account. Now, the new setup includes separate AWS accounts for Production, Data Science, and Data Lake, each fulfilling a distinct role in our MLOps ecosystem. It became imperative to adopt AWS Well-Architected Framework and consider essential factors to maintain data integrity, adhere to stringent security standards, optimise costs, and set the stage for scalable and efficient data management.
By implementing this multi-account architecture, Konfir extends significant benefits to our clients and internal teams. The separation into three different accounts extends further the best practices around data management: fine access controls allowing granting specific permissions to specific teams; better cost management as each account is monitored and optimised separately; separation of concerns allowing each account to be used for different use cases; isolated environments reducing the risk of interference between different workloads.
While having one AWS account is easy to manage—allowing straightforward access to resources within the account—transitioning to a multi-account architecture necessitates additional configuration. In this blog post, we will delve into the process of setting up cross-account S3 permissions, discussing the considerations and best practices involved in the migration to this architecture.
AWS provides an excellent article on the topic.
The following is a summary of the steps we took to transfer the data that was previously in Production account to our Data Lake account:
Importantly, AWS requires the two accounts to allow the role in Production to write in the bucket Data Lake. It is not enough to allow one side only.
Using Terraform simplifies this process as it is possible to reference resources from the other account with the ARN of the resource.
AWS offers two primary methods for copying data across S3 buckets using the CLI: aws cp and aws sync. The main difference is that copy (cp) will copy all the objects regardless if they already exist in the destination (overwrite). sync scans the destination and copies files from the source only if they are new or updated. We needed to move everything so we used cp.
Copying S3 buckets across AWS organisations in an MLOps environment requires careful consideration and adherence to best practices. By setting up IAM permissions, using encryption, optimising data transfer costs, organisations can ensure a secure, efficient, and cost-effective data transfer process. Furthermore, exploring future automation with Glue Jobs offers a scalable solution to handle data copying as data volumes increase, paving the way for successful machine learning deployments and operations.