The MGDO Stack

November 20, 2022

Over the past year, I’ve refined a stack for my personal projects that has been productive and fun.

Makefile to produce a sqlite database
Github Actions as an ETL and scraping platform
Datasette as a public data warehouse
Observable for data analysis and visualisation

Makefile for a sqlite Database

For each dataset, I’ll make a repository that turns source data into a sqlite database with a single make command. The repositories follow this template.

csvkit, sqlite-utils, and csvs-to-sqlite are often the workhouses of the ETL code. Thanks Christopher Groskopf, James McKinney and Simon Willison!

Here are some examples:

Github Actions for ETL and Scraping

GitHub Actions is almost the perfect platform for running ETL jobs and web scraping. It has just about everything you could want.

Schedulable jobs
Execution on demand
Red / Green dashboard for job success
Email notifications if something goes wrong
Execution lives next to code
Serverless
Simple management of secrets
Storage of large artifacts
Parallelism
A large user base
Free! (For public repositories)

The only real limitation I’ve run into is that execution time for a single job is limited to six hours, which can be constraining for large scrapes. Getting around this can take some creativity. Often the best solution is to split the job into smaller bites and run many parallel jobs.

Another small challenge is what to do with the large artifacts produced by an ETL. What I do is manually create a release on the github repository, and then use this github action to stuff the artifacts in the release. I bet I could smooth this over if I wrote a custom Github Action but I haven’t tried yet.

The limit on artifact size attached to a release is 100Gb which has been quite enough so far.

Here’s how I set up the Github Actions script.

Private Repositories

For private jobs, GitHub you get 2000-3000 minutes of execution time for free a month depending on your account type, and then Github charges $0.008/per minute after that.

That can get expensive, but GitHub allows you to dispatch github action jobs on your servers. Azure spot instances + cirun.io makes intensive use of GitHub actions on private repositories affordable.

That GitHub is owned by Microsoft, and that I can pay for GitHub actions and also have an option to pay someone else for the server-time are all some comfort on persistence of the service.

Datasette as a public data warehouse

If you are building things for the web, you need to take extraordinary care to prevent users of your website from making arbitrary queries against your database. The core conceit of Simon Willison’s Datasette project is “What if you didn’t?”

Datasette allows unauthorized users to make arbitrary SELECT queries against sqlite databases, and that ends up being a really powerful thing to do.

I use it to collect all the sqlite databases that I build into a publicly accessible data warehouses. Folks can ask their own questions of the data, share queries, or download the entire databases.

To my mind, the most important feature of Datasette is that for any query, you can get the results back as JSON. This means the websites provides an JSON API that uses SQL directly. It’s amazing.

I have GitHub Actions that run nightly to collect all the databases and pushes the data and code to Google Cloudrun, a scale-to-zero platform. I have CloudFlare set up in front of that, so I’m able to host and serve and 10s of Gb of data a month for less than $5/month.

Here’s what the Github Actions file looks like for the labordata.bunkum.us warehouse.

Observable for data analysis and visualisation

Observable is a lyrical platform for writing JavaScript notebooks for data analysis and visualisation. It has excellent support for working with databases and Datasette instances (using the JSON API I mentioned above).

Many of this notebooks are updated automatically, as the GitHub actions creates updated databases, which are pulled into the Datasette warehouses.

Being able to do arbitrarily complicated SQL queries across multiple tables and then working with the the analysis and visualisation all on a reactive front-end is very, very to fast to build.

Here are some examples:

I’m a bit worried about the free lunch ending with Observable some day, but for now it’s a pleasure.