It's hard to express a fan-out data flow in Makefiles

March 13, 2024

We’ve been using Makefiles for our reproducible data builds at DataMade for years, and it’s been okay.

The data workflows that are hardest to express in Makefiles are fan-out flows, which are unfortunately very common.

Here’s a simple example of a fan-out (actually a fan-out and fan-in or scatter-gather flow): download a zip file that contains many CSVs, remove some unneeded lines from each CSV, and then stack all the processed CSVs into a single large file.

This can be okay, if you know the names of all the files ahead of time, but if you dont’t, you have to write a recursive makefile. Besides being a bit of a boggle, the Makefile no longer is representing a real dependency between creating the data directory and creating the final data. Instead, you have to call targets in the right order in the PHONY target.

.PHONY: all
all :
	make data
	$(MAKE) complete.csv

# Stack all the trimmed CSVs into single file
complete.csv : $(patsubst data/%,%.trimmed,$(wildcard data/*.csv))
	csvstack $^ > $@

# Each CSV needs a few lines trimmed from the top of the file
%.csv.trimmed : data/%.csv
	tail +4 $< > $@

# Unzip a bunch of CSVs
data : data.zip
	unzip $< -d $@

Snakemake improves on this a bit with its checkpoint syntax. It is better than a recursive call, but here, too, we can’t clearly read off the whole chain of dependency relations.

def csv_dependencies(wildcards):
    output_dir = checkpoints.unzip.get(**wildcards).output[0]
    file_names = expand("{name}.csv.trimmed", 
                        name = glob_wildcards(os.path.join(output_dir, "{name}.csv")).name)
    return file_names

rule all:
    input: csv_dependencies
    output: "complete.csv"
    shell: "csvstack {input} > {output}"

rule:
    input: "data/{name}.csv"
    output: "{name}.csv.trimmed"
    shell: "tail +4 {input} > {output}"

checkpoint unzip:
    input: "data.zip"
    output: directory("data")
    shell: "unzip {input} -d {output}"

One reason it’s hard to to express the dependency of the CSVs in the data directory and data.zip is that we need to ultimately resolve the files that complete.csv depends on. There can be many intermediate steps between fanning out the data into the data directory and fanning it back in to complete.csv, and so we might creating a very long distance reference, where it would be better to have somethign more local.

Tupfiles’ bottom-up syntax is better for this. Assuming that the data directory already exists, a Tupfile might look like this

: foreach data/*.csv |> tail +4 %f > %o |> %B.csv.trimmed
: *.csv.trimmed |> csvstack %f > %o |> complete.csv

I was not able to get tup running without setting up FUSE, a kernel extension, which I did not want to do. But from reading the docs, I don’t think there’s a simple way to express the extraction of the CSVs from the zipfile. I’m not even sure it’s possible to run a single Tupfile command and build everything.

It makes sense that build systems that were orginally and primarily built for building programs do not handle fan-outs well. They show up rarely when creating programs.

But fan-outs are very common for data builds, and data builds are very common, and it would be very nice to have a syntax that handled them elegantly.

If you know of a build system that does this well, please let me know!

(I know Airflow and Prefect and their sistren exist. I’m looking for a build system that a reasonable person would actually use on a small, one-off data build)

Addendum: The problem I’m describing here is a species of “dynamic dependency,” described well in the lovely paper, “Build Systems à la Carte: Theory and Practice”.