State transitions in Amazon Web Services’ (AWS) standard Step Function workflows were limiting Amenity’s model development cycle. Migrating to AWS Step Function Express workflows enabled Amenity to run its NLP pipelines at a significant savings in infrastructure and throughput.
State transitions in Amazon Web Services’ (AWS) standard Step Function workflows were limiting Amenity’s model development cycle. Migrating to AWS Step Function Express workflows enabled Amenity to run its NLP pipelines at a significant savings in infrastructure and throughput.
This project is a great example of how to achieve high scalability by migrating from standard Step Functions to Step Functions Express within Amazon Web Services (AWS). Here are the results:
Not only did this migration significantly increase our throughput, it also removed the need for users to coordinate workflow processes to create a build.
Amenity develops enterprise NLP platforms for finance, insurance, and media industries that extract critical insights from mountains of documents. With our software solutions, we provide the fastest, most accurate and scalable way for businesses to get a human-level understanding of information from text.
Amenity’s models are developed with a Test-Driven Development (TDD) and Behavior Driven Development (BDD) approach in order to verify the model accuracy, precision and recall throughout the model lifecycle—from creation to production and maintenance.
AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Workflows manage failures, retries, parallelization, service integrations, and observability so developers can focus on higher-value business logic.
One of the actions in the Amenity model development cycle is backtesting, which is mainly part of our continuous integration (CI) process. The CI process is responsible for running Amenity’s tests and verifying that the model [MOU1] [JG2] performs as expected. (A part of it is running the reviews for all models.)
The backtesting process runs hundreds of thousands of annotated examples in each “code build”. In order to perform such a big process, we used the Step Functions default workflow.
We found that Step Functions standard workflow has a bucket of 5,000 state transitions with a refill rate of 1,500. Each annotated example has around 10 state transitions, which creates millions of state transitions per code build. Since state transitions are limited and couldn’t be increased to an amount that satisfied our needs, we often faced delays and timeouts. Developers had to coordinate their work with each other, slowing down the entire development cycle.
In addition, we needed to change the way each step in the pipeline is triggered — from an async call to a sync API call.
When a model developer merges their new changes, the CI process starts backtesting for all existing models.
For each model, the backtesting process checks if the model review items were already uploaded and saved in the Amazon Simple Storage Service (S3) cache. The inspection is made by a unique key representing the list of items. Once a model is reviewed, the review items will be changed rarely. We want to avoid uploading its items every time.
If the review items haven’t been uploaded yet, it[MOU2] [JG3] the backtesting process uploads them and triggers an unarchive process in order to use it in the execution phase.
When the items are uploaded, an execute request is sent with its review items key through Amazon API Gateway.
The request is forwarded to an AWS Lambda function, responsible for validating the request and creates and inserts a job message to an Amazon Simple Queue Service (SQS) queue.
The SQS messages are consumed by a limited amount of concurrent Lambda functions, which synchronously invokes a Step Function. The reason the number of lambdas is limited is to ensure that the amount limit of lambdas in the production environment has not been reached.
When an item is finished in the Step Function, it creates an SQS message containing a notification message. This SQS message is inserted into a queue and consumed as a batch of messages by a Lambda function. The function aggregates the messages by the end-users and sends for each user an AWS IoT message containing all the relevant messages.
In order to change from async to sync processing, we had to replace SNS + SQS with an API gateway.
The following diagram shows the process of a single document in Step Function express:
For our base NLP analysis, we use Spacy. The following diagram shows how we used it in Step Function Express:
By migrating from standard Step Functions to Step Functions Express, Amenity made their process 15 times faster. A complete pipeline that used to take around 45 minutes with standard Step Functions, now completes in about 3 minutes with Step Functions Express.
The previous limitations caused the users to have to sync with each other with just a single execution at a time. (If they did not do this, their build would have failed.) The migration removes this limit, and now users execute processes as they like.
As we mentioned before, our CI process contains two parts (which run in parallel):
We had two phases of migrations: the backtesting phase and the unit test phase. The backtesting part was reduced from more than one hour to approximately six minutes (P95). This migration was deployed on August 10.
The unit test part was reduced from around 25 mins to around 30 sec. This migration was deployed on September 14.
We can see the effect in the following diagram:
After the first migration, the CI was limited by the unit tests which took about 25 mins. When the second migration was deployed the time has reduced to about 6 mins (P95).
This communication does not represent investment advice. Transcript text provided by S&P Global Market Intelligence.
Copyright ©2021 Amenity Analytics.