Breaking up a monolith

Some ideas on dividing a large codebase into smaller projects that all communicate with a central API.

Post thumbnail

Project Anvil is the name I gave to the biggest and most significant project refactor I’ve ever undertaken. I don’t want to call it a rewrite - although it really feels like one - because the actual functionality and language isn’t changing; mostly I’m just moving bits of code around and changing how the disparate parts connect and are deployed. Also a rewrite sounds scary, and a refactor sounds grown-up.

Some background

Podiant is the project in question, and it can be considered a monolith, as it’s a single large codebase run on a single framework. The databases are cloud-hosted and there are some microservices for things like download tracking and the beginnings of an Alexa API, but the marketing site and the dashboard are thoroughly traditional Django apps with URL routes, views and templates rendered on the server.

Prior to this, the biggest refactor I did was to convert the project to Python 3, and change the way the app was hosted and deployed, moving from pets to cattle (a paradigm shift that means you spin up new servers to replace old ones, instead of deploying new code and configuration to existing serves which you manually provision and maintain).

But the future is containerisation via Docker, which adds a layer of abstraction to your setup, meaning you can have multiple machines each running multiple instances of your codebase, and have intelligent load-balancing between servers, so you can scale up and down (in theory) without actually having to provision new servers. (That’s not strictly a Docker thing but a subsequent add-on.)

To begin with, Project Anvil was just about adding an API to Podiant, and refactoring the dashboard into a JavaScript app that would leverage the API, instead of the server rendering all the user interface elements. I only recently went down the Docker route after getting frustrated at dealing with constantly failing VirtualBox images, then it occurred to me that if I were to containerise the project, I may as well do it right.

Doing it wrong

The biggest mistake I’ve made so far is in wasting time developing my own JavaScript framework, based in jQuery, to run the dashboard. It works pretty well, but it’s not a good move as it’s entirely unproven. I’m a pretty solid developer, but I want my codebase to be built on trusted, proven technologies and as little custom code as possible.

The brains at Ember.js - the framework I eventually settled on - have spent years figuring out the best way to handle data from external sources and cache it locally, support different browsers, manage the DOM and events so the browser doesn’t run out of memory… I’m just far better off standing on their shoulders.

Doing it right

A few days ago I decided to go HAM, and completely blitz everything in my codebase and start again. I’m still ironing out the details, but the current plan is to go with a Docker app for the REST API, which will be the heart of the codebase, an Ember app for the marketing site, another for the dashboard and something yet-to-be-determined for the podcast pages (I’ll explain why that’s complicated in a bit).

The final piece concerns background tasks, and it’s something that only occurred to me last night. Having two images based off the same codebase feels like bad practice, but there are lots of important jobs that are done in the background and must report their progress and update the database with newly-discovered info. Here’s a use-case: When a user uploads a piece of audio, Podiant runs a number of jobs as part of what I call a workflow. For example:

  • The convert to 96kbps and remove all metadata job always needs to be run
  • if the podcast has no artwork, the add artwork to MP3 file job doesn’t need to be run
  • The add chapters to MP3 job needs to be run if the user wants to add chapters
  • the create waveform image job always needs to be done.

When each job in a workflow is finished, it reports its status in realtime (via Pusher ) and when all the jobs are completed, the database is updated with the episode MP3 URL, the duration, the filesize and a graphical representation of the waveform. If a workflow like this were to run in isolation in a distributed system, it would need Django and a copy of the episode and podcast models, so it can make those database changes. But that’s icky, so I’m looking to make workflows real-world models (instead of temporary objects stored in a cache) and allow them to store arbitrary data in JSON format, which would represent the updates that need to be made to the database.

When the workflow is completed, it POSTs to the workflow’s API endpoint (something like ), with the JSON data representing the changes that have been applied. The API then naively applies those changes to the database, and can handle the realtime reporting.

This way, the workflows can be separate processes run in isolation, that the Django API doesn’t technically need to be aware of (it just needs to know the fully-qualified name of the workflow, or even a URL to the workflow’s endpoint, AWS Lambda style). The workflows can be lighter as they just need to be Python scripts that take in an input and return a serialisable object.

There are two problems on which I’m not yet sold: one is how to manage scheduled tasks (cron jobs), and the other is the podcast sites themselves. Cron jobs will probably end up being easier, but in an ideal world I’d love to be able to spin up a container that runs the job and then is destroyed, similar to how Heroku handles scheduled tasks. The harder problem is the frontend site for podcasts.

Ember and on-the-fly templates

The ability for Podiant users to edit their sites’ theme is currently in private beta, but it’s a really powerful feature that gives people immense flexibility. This is facilitated by a sort-of API and a lot of custom JavaScript. When a page is requested by a browser, a bunch of <script> tags are spat out into the HTML page, which contain all the templates needed to render the site. The server also gives the page a JavaScript object containing all the info needed to render the current template. Subsequent pages are requested via AJAX, so no new HTML is downloaded, only new JavaScript.

The templating is all done using Handlebars.js, which, happily is the same library used by Ember.js, however in Ember, those templates are compiled into JavaScript and converted into DOM elements, so I’ll need to find out if it’s possible to render templates on-the-fly instead of baking them into the app.

The game plan

So, after a day of running Ember.js’s development server in Docker, I have a piece of advice for you:

Don’t run Ember.js’s development server in Docker.

You’re welcome.

— Mark Steadman (@iamsteadman) March 6, 2018

I wanted to start with something fairly inconsequential so I could tease out some of the harder problems before diving deeper in, so as of today I have an Ember app running the marketing pages (what I call the brochure site).and a Dockerised API server running the Django REST Framework, which was absurdly easy to get going thanks to a cookiecutter template.

The Ember app talks to the API to get posts from the Podiant blog, and a list of podcasts for the directory pages. The API is based off the original codebase, but with everything except what is needed to run the API stripped out, and no database modifications (that way I’ll be able to build the dashboard and beta test it with users on the production database).

The next thing

There’s a bit of tidying to do with the directory, but the next big thing will be user management (signup, login and logout). I’d like to refactor the signup process so that users can sign up and create a podcast in what feels like one step, I expect I’ll be using Djoser for this, as it provides a RESTful backbone for authentication and signup.

The challenge will probably be securing the API such that it can run via the website but can’t be accessed directly. I have a very long road ahead of me, but I hope that by tackling some of the easier problems first - and trying to do every step right - I’ll be better set to tackle the harder problems in a few weeks. It’s a long road; I just really, really hope it’s worth it, because while it’s fun to play with new toys and development patterns, it’s got to be for something, and I’m banking on it meaning that Podiant can go further and do more amazing things in the future, with a robust, stable but flexible infrastructure. Wish me luck, and if you have any thoughts or you think there’s a better way to skin any of the above cats (ew), let me know.

If you liked this, you might also like these

Installing Rancher to create and manage a Kubernetes cluster

A rundown of the steps I took to install Rancher on a group of DigitalOcean droplets to deploy and run a containerised app.

From S3 to Spaces

A note on migrating from Amazon S3 to DigitalOcean's object storage system, Spaces.

Apple wants you to stop sharing your login details with your team

Apple sent an email to podcasters last week, about sharing login details and two-factor authentication. Here’s what it all means.

It’s easier to record your podcast remotely than in person

Recording in-person podcasts was far more common before the pandemic than it is now. But with the world opening up again, let’s remind ourselves why it’s easier to record remotely.

What is Podcasting 2.0, and why is it important?

Podcasting 2.0 is an initiative by one of the co-creators of the podcast medium. It’s trying to help the industry evolve, but not all of it is easy to understand. As a busy creator, here’s what you need to know about Podcasting 2.0.