Python Flask: Asynchronous Upload to AWS S3

Published in

Level Up Coding

6 min readJun 12, 2022

Achieving sub-second response when uploading files to AWS S3 with threading and Celery

There is a time when your client asks you to find a way to upload a relatively big file, but still relatively fast although you can’t immediately know the result of the uploading process.

Looking for another scenario, there might be a time when you need to preprocess your files or data stored in CSV or Excel files, and there are several steps that need to be done before it arrives at your data lake or database.

Usually, uploading (or additionally processing) data will take time and I’m pretty sure that that’s just plain weird for a user to wait until the process is completed. If that’s the case, then having an asynchronous upload will give you a lot of benefits.

In this article, we’ll take a look at several ways to upload files synchronously and asynchronously. Let’s go right into it!

Preparations

Before going further to write some code, let’s start by preparing all things we need to start our project.

First, ensure that you have AWS access key and secret access key. Since this is essential to upload our files to an S3 bucket.

After that, we need to create an S3 bucket. Go to your AWS console and search for S3, then click “Create Bucket”. Now, you can fill in the bucket name (it must be a unique name) and region (keep it close to you and your users), and for the sake of this tutorial, we’ll grant public read access to our S3 bucket.

Next, go to your newly created bucket and go to Permissions tab. Then, add/edit bucket policy with this JSON:

Remember to change BUCKET_NAME to your own bucket name.

To store sensitive data, we’ll use .env . So, create .env inside your project’s root directory.

Put your AWS access key and secret access key there, also with the S3 bucket name and S3 bucket base URL. Typically, your S3 bucket base URL would be in this format: https://<BUCKET_NAME>.s3.<REGION>.amazonaws.com, for example https://mybucket.s3.ap-southeast-1.amazonaws.com .

Finally, to make sure that we can run Celery on our local machine, we need Docker and docker-compose. If you don’t have those installed, then you can follow this documentation https://docs.docker.com/engine/install/

Coding Time

In the previous section, we created our S3 bucket and granted public read access. Also ensured that we had Docker engine and docker-compose installed, and had AWS access key and secret access key.

In this section, we’ll start building our project by first creating a virtual environment. Run this to create a virtual environment:

python -m venv venv

Then activate the virtual environment by running:

# Mac OS / Linux
source venv/bin/activate# Windows
venv\Scripts\activate

Now, install the required dependencies:

pip install flask Flask-SQLAlchemy boto3 celery

For the configuration, let’s create a file named config.py and put this code there:

Here, we created a Config class to store all required information (including sensitive data) and created a Celery instance to define Celery’s tasks later on.

Since we’re actually uploading our files to an S3 bucket and we need to know where those files are stored, then we need a database to store the target file URL. To do that, create a file named models.py with this content:

Here, we were defining a model entity called File, with fields id, name, url, and upload_status. For the upload status, it’s an Enum, so it can only be between four values, PENDING, PROCESSING, COMPLETE, and ERROR . This status is important since we’ll upload our files asynchronously, so a user doesn’t need to wait until the uploading process is done, but he/she can check the uploading status right away.

Now for the upload functions. Create file.py and put this there:

Explanations for the above code snippet:

We have four functions in total, related to uploading files. The first one is rename_file , just like the function name suggests, this function will rename the file we want to upload by combining the original filename and current timestamp, and also ensure that a user provides the correct filename in the first place.
For process_file_to_stream , we need this function to convert a file type object to bytes stream because we can’t directly give an argument of file type object to a callable function inside a thread or a Celery’s task. Actually, there would be another way to do this by storing that file temporarily and then uploading it to the S3 bucket. Also notice that there is an argument to_utf8 , this is used if we want to upload files using Celery because it’s impossible to send bytes to Celery’s task (basically because all data that get sent to Celery’s tasks must be JSON serializable).
upload_file is the normal synchronous way to upload files to an S3 bucket. We don’t need to convert it to bytes stream, we can directly upload it using upload_fileobj function from boto3.
Inside upload_file_from_stream we convert back bytes or utf8 strings to a file-type object. Then we can normally upload it using upload_fileobj .

Before starting to test our project, we need to open all defined functions through endpoints. Let’s create a Python file named routes.py :

Explanations for the above code snippet:

index : This is pretty self-explanatory, right? 😊
normal_upload : This endpoint uses a synchronous upload function. So users need to wait until the uploading process is completed.
async_upload : This endpoint uses threading to handle the uploading process. Therefore this function will return immediately without having to wait until the process is completed. Notice that in line 51, we created a Thread instance and provide it with a callable function, and arguments required to run that callable function. Inside __async_upload definition, we use application context so we can interact with the application or the database.
celery_upload : This endpoint requires Celery’s task to be defined. We’ll do that now.

Let’s define the Celery’s task that is used inside celery_upload :

Pretty simple right? First of all, get a File entity using file_id and upload the file (in form of a file dictionary containing bytes stream, filename, content type, etc) using upload_file_from_stream . Then commit the completed status if everything works correctly, and commit the error status if an exception happens.

Finally, combine all this code in app.py . This is the entry point of our project.

With everything set into place, now we can test our project. But first, we need Dockerfile and docker-compose if we want to test the celery upload endpoint.

Create Dockerfile and docker-compose.yaml:

Sanity Check

Let’s run our project by running:

docker-compose up

This command will create three containers, the application, Redis instance, and celery worker.

Go to http://localhost:5000/normal_upload to test the normal upload. Provide form data with the key file and value of a file to be uploaded.

You can do the same to /async_upload and /celery_upload .

How long it will take to complete the uploading request? In my case, without a very fast internet connection, it’ll take 5 seconds to upload a 10MB of PDF file using normal upload, less than 100 ms using async upload, and 100–200 ms using celery upload. What about you?

Disclaimer, the above numbers certainly are not concrete evidence of benchmarking. So, to make such a claim, it’d be better to do load testing. If you’re curious about load testing, I have good resources for you:

Load Testing using Locust.io

There is a time after our application or service is running, we want to know the performance and load that can be…

medium.com

Load Testing with k6

A bunch of load testing capabilities with only one tool

levelup.gitconnected.com

Conclusion

We’ve seen how we can upload files to AWS S3 bucket using several ways, normal upload, upload using threading, and upload using Celery. In general, we can levitate the usage of threading and job queue for our needs, in this case, achieving a sub-second response when a user is uploading a file.

You can also use this scenario if you want to preprocess the uploaded files, for example making multiple copies with different sizes for images, and cleaning up data points for your data science project.

Here is the coding material for this article:

flask-workbook/async-upload-to-s3 at master · agusrichard/flask-workbook

Learning Web Development Projects with Flask. Contribute to agusrichard/flask-workbook development by creating an…

github.com

Thank you for reading and happy coding!