Creating an image manipulation pipeline with Amazon S3 and Lambda functions

Uploading, resizing and backing up thousands of images stored in a basement

Post thumbnail

I’m working on a project involving thousands of massive TIFF images, which are sitting in a basement. Those images need to go to Amazon’s Glacier service to be backed up, and available in smaller JPEG form elsewhere (a destination that for all intents and purposes is an Amazon S3 bucket).

Accessing the images

I wanted to be able to upload the images without my broadband being throttled, so I figured the best way to do that would be to setup an Ubuntu VPS which would connect to the basement file storage system via VPN. I won’t document that process as it’s very specific, but it involve installing the Ubuntu GUI and installing some other packages to get it to work with Microsoft Remote Desktop on my Mac.

Once I had access to the files, the next step was to install s3cmd to traverse the file structure and upload any TIFFs to S3.

The Lambda function

AWS Lambda is their serverless compute platform, that allows you to run arbitrary code when a specific thing happens: in this case, when a new object is uploaded to a particular S3 bucket.

With this setup comes some gotchas.

The Lambda Python environment isn’t containerised

You can’t just install libraries using pip install like you can with literally every other system that runs Python on demand. Instead, you have to package up your entire Python stack as a Zip file and upload it… which is silly because you’re almost always going to need to use an external library if you’re doing anything useful.

In order to get this to work, you need to create a Docker image with your Python code in it, zip up the file and then put the file somewhere on your machine so you can upload it to AWS. I got a lot of help from Mark van Holsteijn’s article on the subject.

My dockerfile looks like this:

FROM python:3.7
RUN apt-get update && apt-get install -y zip
WORKDIR /lambda

ADD requirements.txt /tmp
RUN pip install -t /lambda -r /tmp/requirements.txt && find /lambda -type d | xargs chmod ugo+rx && find /lambda -type f | xargs chmod ugo+r

ADD *.py /lambda/
RUN find /lambda -type d | xargs chmod ugo+rx && find /lambda -type f | xargs chmod ugo+r

RUN python -m compileall -q /lambda
RUN zip --quiet -9r / .

FROM scratch
COPY --from=0 / /

You then need to build your Docker image and extract the Zip file. I tagged my Docker image ingesto as that’s the name of the little script I’ve written, so the process looks like this:

docker build -t ingesto .
ID=$(docker create ingesto /bin/true)
docker cp $ID:/ ~/Desktop

That puts a file on my desktop, which I can then upload to AWS.

Lambda messes with S3 key names

When you setup your Lambda function, you need to add an S3 trigger to run on every object-create event. AWS will then pass info about the uploaded object to your Lambda function (specifically the lambda_handler function in your file).

However, it gets confused if you have spaces in your filenames, so when it passes the object key (basically the filename) to the lambda function, it URL-encodes the filename, replacing spaces with + symbols… which would be fine, except Amazon’s own boto3 library (their Python SDK) throws errors when you pass that exact key name into the S3 client to download the image.

For the avoidance of doubt, there’s no reason the Lambda trigger should futz with the key name because the whole data dump is JSON-encoded anyway.

A simple workaround is to use urllib.parse.unquote_plus to fix the key name before you pass it to boto3 to download the contents.

The timeout setting is wrong

Lambda functions have a default 3 second timeout. This would be fine except the function cannot run under any circumstances within that timeframe, so you’ll need to increase that value in the function’s settings (within the AWS dashboard, not the code). It shouldn’t take longer than 30 seconds for a very large image,

Resizing and uploading

The rest is really simple: I use pillow to resize the image to a maximum of 3000 pixels wide or tall, and upload that to my destination.

Again I use boto3 for this. The bucket I’m pushing to isn’t actually in AWS, but it uses an S3-compatible API, so all I have to do is configure the client to use the destination bucket credentials… which looks something like this:

import boto3
session = boto3.Session()
s3 = session.client(

I do a bit of MIME type sniffing for the file I want to upload — which will always be a JPEG, but I like to write isolated functions that don’t make assumptions — and then upload the file to its new location, under a more unified naming scheme (removing spaces for a start).

Other considerations

When I first set this up, I completely messed up the access control. It was important that files uploaded from the basement weren’t public-readable, so I set the ACL on those files to be authenticated-users upon upload. I don’t know why, but the first time I tried this function, I was getting 403 errors, as Both thought it didn’t have access to the images. I tried creating new user roles and messing with bucket policies to no avail. In the end, I scrapped everything and started again, creating a role from an Amazon-provided template that just gave itself read-only access to bucket contents. I think technically it owl have access to any other bucket’s data within my account, but that’s not an issue.

The bucket policy seems to be pretty lax, but I’ve checked, and the files being uploaded aren’t accessible to the outside world, so I’m happy.

If you liked this, you might also like these

Apple wants you to stop sharing your login details with your team

Apple sent an email to podcasters last week, about sharing login details and two-factor authentication. Here’s what it all means.

It’s easier to record your podcast remotely than in person

Recording in-person podcasts was far more common before the pandemic than it is now. But with the world opening up again, let’s remind ourselves why it’s easier to record remotely.

What is Podcasting 2.0, and why is it important?

Podcasting 2.0 is an initiative by one of the co-creators of the podcast medium. It’s trying to help the industry evolve, but not all of it is easy to understand. As a busy creator, here’s what you need to know about Podcasting 2.0.

How to embed audio from on your blog

In lieu of a sane way to embed an audio player into your blog, here's a solution that gives you something your listeners can hear.

Installing Rancher to create and manage a Kubernetes cluster

A rundown of the steps I took to install Rancher on a group of DigitalOcean droplets to deploy and run a containerised app.