Salesforce is just a really weird web framework

2025-12-18T00:00:00+00:00

In the past year, I had to work with Salesforce for the first time as a long-time web developer.

It took me a while to understand what it actually was as a piece of technology, and when I did understand, I was horrified.

Salesforce says that it is a CRM (or Customer Relationship Management software), but that means almost nothing.

Salesforce comes default with some core “objects” like “Account” (a business-like thing) and “Contact” (a person-like thing). There are also default relationships set up between the core objects, so that you can represent that a particular contact works for some business. There are also pages set up so that you can see lists of these objects, see the details of a particular objects, and create new instances of objects.

These default settings presumably go some distance in helping capture the information that a sales team often wants when selling things to businesses.

However, that default functionality is not really what Salesforce is.

Here’s the thing. You can create new types of objects or change the fields that exist on the existing core objects. You can define new relationships between objects. You can, within broad limits, decide what information appears on the listing and detail pages for objects.

If you are web developer, then this is just a variation of MVC. Where objects are Models, pages are Views, and the Controller logic is smeared around different configuration locations.

What that means is that you can build web applications of arbitrary complexity within Salesforce, mainly through their web interface.

A web application of arbitrary complexity that’s not really under version control, and which is not really fully testable, and which is really only works within single, proprietary environment. Yikes!

On the other hand, it’s serverless. Also, most of the security issues are SalesForce’s problems not yours.

Politically, Salesforce has other advantages. For organizations, building the web application in SalesForce often only requires the unit to get authorization for purchasing a single service, whereas as building the application in a normal programming language and deployed to normal servers would require the coordination and collaborating with the IT department. Additionally, the build within Salesforce can sometimes be characterized as OpEx instead of CapEx, which can sometimes be helpful.

So Salesforce is a way of building web applications without fully acknowledging that’s what you are doing. It’s an impressive technical achievement.

Snakemake for PDF text extraction is pretty pleasant

2024-03-22T00:00:00+00:00

For Chicago Councilmatic, we’ve wanted to experiment with using a large language model to write abstracts for the bills.

To do that, we needed the text of the legislation, which are published as PDFs and Microsoft Word files. Text extraction for the Word files is pretty easy, but text extraction for PDFs are not.

In our experience, the least maddening way to get text out of PDFs is to turn each page of the PDF into an image, use an OCR tool like tesseract to turn that image into text, and then recombine that text back into a single file. OCR is a compute intensive task, so we need to parallelize that task to get good throughput.

I’ve written a data pipeline to do that before in a Makefile, and it was hard to write and even harder to read. This time, we wrote it using snakemake and it was much, much better.

Below is the heavily annotated Snakefile.

# snakemake defines a domain specific language (DSL), but 
# everything that it does not parse as part of that DSL, 
# it interprets as normal python.
import csv
import pathlib


def text_files(wildcards):
    """
	As the result of some process, we have a CSV with 
	the urls of documents we need to download and 
	process. So, we make a list of the text files that 
	we will ultimately produce. If the orginal document 
	is called "example.pdf," we will want to produce a 
	text file called "example.pdf.txt".
	"""

    with open('urls.csv') as f:
        reader = csv.DictReader(f)
        file_name = [row["url"] for row in reader]

    return expand("{file_name}.txt", file_name=file_name)


# This is default target rule. Running Snakemake will try 
# to extract the text from every document in `urls.csv`
rule all:
    input: text_files


# We have to handle both docx and pdf documents. This is 
# the rule for the docx files. Notice the 
# wildcard_constraint which is just regex.
rule to_text_docx:
    output: "{source_name}.txt"
    input: "{source_name}"
    wildcard_constraints:
        source_name="[a-z0-9-]+\.docx"
    shell:
        """
        pandoc -i {input} -t plain > {output}
        """


def aggregate_texts(wildcards):
    """
	To process the pdfs, we will turn every page of 
	the pdf into a separate image, OCR that image, 
	and then recombine the text files. This function 
	gets the names of the individual page files 
	(which we can't know until we turn the pdf into a 
	bunch of page-images) and turns those into the 
	names of files we will use as dependencies for 
	recombining into a single text file.
	
	Notice that there are *two* kind of wildcards in 
	this text tranformation, which is very ugly to do 
	within a Makefile
	"""
	

    image_directory = pathlib.Path(checkpoints.to_images.get(**wildcards).output[0])
    files = expand(
        f"text/{wildcards.source_name}/page-.txt",
        page_num=glob_wildcards(image_directory / "page-{page_num}.ppm").page_num,
    )
    return sorted(files)


rule to_text_pdf:
    output: "{source_name}.txt"
    input: aggregate_texts
    wildcard_constraints:
        source_name="[a-z0-9-]+\.(pdf|PDF)"
    shell:
        """
        cat {input} > {output}
        """

rule tesseract:
    output: "text/{source_name}/page-{page_num}.txt"
    input: "images/{source_name}/page-{page_num}.ppm"
    shell:
        """
        mkdir -p text/{wildcards.source_name}
        tesseract -l eng --dpi 150 {input} text/{wildcards.source_name}/page-{wildcards.page_num} txt
        """

# This is the rule that actually turns the PDF into a 
# bunch of images. Notice that it is a "checkpoint" not 
# a "rule." This is how Snakemake allows you to do 
# dynamic dependencies. Also notice that the output is a 
# directory, which is a kind of target Makefiles do not 
# always handle well.
checkpoint to_images:
    output: directory("images/{source_name}/")
    input: "{source_name}"
    wildcard_constraints:
        source_name="[a-z0-9-]+\.(pdf|PDF)"
    shell:
        """
        mkdir {output}
        pdftoppm -r 150 {input} {output}/page
        """

This is still complex, but much clearer than the equivalent Makefile.

As of the the posting date of this article, you should use a previous version of snakemake.

pip install snakemake==7.32.4 PuLP==2.3.1

The developers of snakemake recently completed a major refactor, and some of the checkpoint handling has had regressions, I’m sure it will be fixed soon.

It’s hard to express a fan-out data flow in Makefiles

2024-03-13T00:00:00+00:00

We’ve been using Makefiles for our reproducible data builds at DataMade for years, and it’s been okay.

The data workflows that are hardest to express in Makefiles are fan-out flows, which are unfortunately very common.

Here’s a simple example of a fan-out (actually a fan-out and fan-in or scatter-gather flow): download a zip file that contains many CSVs, remove some unneeded lines from each CSV, and then stack all the processed CSVs into a single large file.

This can be okay, if you know the names of all the files ahead of time, but if you dont’t, you have to write a recursive makefile. Besides being a bit of a boggle, the Makefile no longer is representing a real dependency between creating the data directory and creating the final data. Instead, you have to call targets in the right order in the PHONY target.

.PHONY: all
all :
	make data
	$(MAKE) complete.csv

# Stack all the trimmed CSVs into single file
complete.csv : $(patsubst data/%,%.trimmed,$(wildcard data/*.csv))
	csvstack $^ > $@

# Each CSV needs a few lines trimmed from the top of the file
%.csv.trimmed : data/%.csv
	tail +4 $< > $@

# Unzip a bunch of CSVs
data : data.zip
	unzip $< -d $@

Snakemake improves on this a bit with its checkpoint syntax. It is better than a recursive call, but here, too, we can’t clearly read off the whole chain of dependency relations.

def csv_dependencies(wildcards):
    output_dir = checkpoints.unzip.get(**wildcards).output[0]
    file_names = expand("{name}.csv.trimmed", 
                        name = glob_wildcards(os.path.join(output_dir, "{name}.csv")).name)
    return file_names

rule all:
    input: csv_dependencies
    output: "complete.csv"
    shell: "csvstack {input} > {output}"

rule:
    input: "data/{name}.csv"
    output: "{name}.csv.trimmed"
    shell: "tail +4 {input} > {output}"

checkpoint unzip:
    input: "data.zip"
    output: directory("data")
    shell: "unzip {input} -d {output}"

One reason it’s hard to to express the dependency of the CSVs in the data directory and data.zip is that we need to ultimately resolve the files that complete.csv depends on. There can be many intermediate steps between fanning out the data into the data directory and fanning it back in to complete.csv, and so we might creating a very long distance reference, where it would be better to have somethign more local.

Tupfiles’ bottom-up syntax is better for this. Assuming that the data directory already exists, a Tupfile might look like this

: foreach data/*.csv |> tail +4 %f > %o |> %B.csv.trimmed
: *.csv.trimmed |> csvstack %f > %o |> complete.csv

I was not able to get tup running without setting up FUSE, a kernel extension, which I did not want to do. But from reading the docs, I don’t think there’s a simple way to express the extraction of the CSVs from the zipfile. I’m not even sure it’s possible to run a single Tupfile command and build everything.

It makes sense that build systems that were orginally and primarily built for building programs do not handle fan-outs well. They show up rarely when creating programs.

But fan-outs are very common for data builds, and data builds are very common, and it would be very nice to have a syntax that handled them elegantly.

If you know of a build system that does this well, please let me know!

(I know Airflow and Prefect and their sistren exist. I’m looking for a build system that a reasonable person would actually use on a small, one-off data build)

Addendum: The problem I’m describing here is a species of “dynamic dependency,” described well in the lovely paper, “Build Systems à la Carte: Theory and Practice”.

A sqlite data layer for dedupe?

2023-01-11T00:00:00+00:00

Right now, the dedupe library ultimately expects data to be represented as a stream of Python dictionaries. This design decision has made the library very flexible, since it does not need to know anything in particular about how the data is originally stored.

However, this design has two important costs. First, it substantially limits the places where dedupe can profitably use parallel processing to take advantage of multiple CPUs. Second, many of the operations of the library could be done much faster if it was able to know more about and therefore cooperate more effectively with the data layer.

If the library was built with the expectation that the data was stored in a sqlite database, the base library could significantly increase the scale of data that could be processed into the tens of millions

Benefits of a sqlite data layer

Costs of interprocess-communication

Blocking is the clearest example or a problem that should be easy to do in parallel, but one where we get no advantage of parallel processing with our current architecture.

Basically, blocking is applying a kind of hash function to thousand or millions of records where order does not matter. The way that we do this now is apply the blocking function to a stream of records represented as python dictionaries, effectively:

block_keys = map(blocking_function, data stream)

which can be easily parallelized as

import multiprocessing

pool = multiprocessing.pool(NUM_PROCESSES)

block_keys = pool.imap_unordered(blocking_function, 
                                 data stream)

But this ends up not being very useful, because of interprocess communication.

Basically, the multiprocessing imap works like this: the parent process will pull a chunk of the data from the data stream, pickled the data to a byte-string, and send the pickled data over a socket to a child process. The child process will listen on the socket, unpickle the chunk of data, apply the block_function, then pickle the resulting block keys, communicate the bytestring back to the parent process, which, finally, deserializes it.

If the actual application of blocking function is computationally cheap, then the all benefits are having more than one core working on the problem overwhelmed by the overhead of all that serializing and deserializing.

If the data is stored in a relational database, we could parallelize by having each process separately connect to the database, pull their own chunk of data from the database, calculate the block keys and write those block keys for the chunk to the database. While we still need to effectively serialize/deserialize in our communication with the database, the number of times we need to pay for that overhead can be hugely reduced and the per-call overhead will also typically be much smaller than pickling/unpickling.

Pushing operations to the data layer

If can build off a relational database, then some of work that that library is currently doing fairly slowly in Python could be done much more quickly in the data layer.

Blocking, again, is clear example.

Many of our blocking functions are very simple. For example, take the first seven characters of the address field. Even with a good parallel processing model, it will be very hard for a Python solution to beat

INSERT INTO block_keys (key, record_id)
SELECT
    substring(address, 1, 7),
    record_id
FROM
    data;

Collateral benefits

While the lowest level dedupe methods operate over streams of dictionaries, the high level API assumes that the data represented as a Python dictionary, necessarily stored in memory. This requirement means that the users of the high level API face a limit in how much data they can process as it has to fit in memory.

Moving to an architecture that requires the data to be stored in a relational database would remove that restriction.

Downsides of a sqlite data layer

Type Inflexibility

Python objects are very flexible, and dedupe has taken advantage of that by having comparators that work over numbers, strings, tuples, and sets.

Relational databases do not have that flexibility. Core sqlite supports floats, integers, and strings, and bytes and that’s basically it.

If we use sqlite as the database layer. Then we will either lose the ability to use compare some types of objects, have to use type adaptation, or represent the data more indirectly.

Type adaptation is another layer of serialization and deserialization on top of the serialization from sqlite data to python objects. Type adaptations are written in Python and can introduce significant overhead.

Collection objects like tuples, arrays, and sets can also be represented in sqlite as normalized tables. This strategy would significantly increase the complexity of the code of serializing and deserializing a record from python to sqlite and back. On the other hand, if the data is represented in a normalized form in the database then some of the dedupe operations over collections could be pushed down into the database layer.

Type adaptation is probably the best near-term strategy.

Table definition generation

if we want to keep something like our existing API, then we need code to automatically convert a Python dictionary of record to a sqlite table, which is going to require inferring a SQL table definition from the Python data.

This is complex. This some reasonable prior art to follow from pandas or sqlite-utils, but we would need to write our own, and it’s a boring and bug-prone piece of functionality.

If we started our library with an existing sqlite table or pandas data frame, then we could side step much of this, but that would be a very big departure from the the existing API and would also impose limits on type flexibility.

Supporting multiple database engines

While i would recommend that the core library be written with sqlite as the database, many users will want a different database engine.

If we support that, then that will be another layer of complexity to manage. Something like sqlalchemy may be able to help some, but we are going to need write engine-specific queries for to take advantage of the different facilities of different engines.

The MGDO Stack

2022-11-20T00:00:00+00:00

Over the past year, I’ve refined a stack for my personal projects that has been productive and fun.

Makefile to produce a sqlite database
Github Actions as an ETL and scraping platform
Datasette as a public data warehouse
Observable for data analysis and visualisation

Makefile for a sqlite Database

For each dataset, I’ll make a repository that turns source data into a sqlite database with a single make command. The repositories follow this template.

csvkit, sqlite-utils, and csvs-to-sqlite are often the workhouses of the ETL code. Thanks Christopher Groskopf, James McKinney and Simon Willison!

Here are some examples:

Github Actions for ETL and Scraping

GitHub Actions is almost the perfect platform for running ETL jobs and web scraping. It has just about everything you could want.

Schedulable jobs
Execution on demand
Red / Green dashboard for job success
Email notifications if something goes wrong
Execution lives next to code
Serverless
Simple management of secrets
Storage of large artifacts
Parallelism
A large user base
Free! (For public repositories)

The only real limitation I’ve run into is that execution time for a single job is limited to six hours, which can be constraining for large scrapes. Getting around this can take some creativity. Often the best solution is to split the job into smaller bites and run many parallel jobs.

Another small challenge is what to do with the large artifacts produced by an ETL. What I do is manually create a release on the github repository, and then use this github action to stuff the artifacts in the release. I bet I could smooth this over if I wrote a custom Github Action but I haven’t tried yet.

The limit on artifact size attached to a release is 100Gb which has been quite enough so far.

Here’s how I set up the Github Actions script.

Private Repositories

For private jobs, GitHub you get 2000-3000 minutes of execution time for free a month depending on your account type, and then Github charges $0.008/per minute after that.

That can get expensive, but GitHub allows you to dispatch github action jobs on your servers. Azure spot instances + cirun.io makes intensive use of GitHub actions on private repositories affordable.

That GitHub is owned by Microsoft, and that I can pay for GitHub actions and also have an option to pay someone else for the server-time are all some comfort on persistence of the service.

Datasette as a public data warehouse

If you are building things for the web, you need to take extraordinary care to prevent users of your website from making arbitrary queries against your database. The core conceit of Simon Willison’s Datasette project is “What if you didn’t?”

Datasette allows unauthorized users to make arbitrary SELECT queries against sqlite databases, and that ends up being a really powerful thing to do.

I use it to collect all the sqlite databases that I build into a publicly accessible data warehouses. Folks can ask their own questions of the data, share queries, or download the entire databases.

To my mind, the most important feature of Datasette is that for any query, you can get the results back as JSON. This means the websites provides an JSON API that uses SQL directly. It’s amazing.

I have GitHub Actions that run nightly to collect all the databases and pushes the data and code to Google Cloudrun, a scale-to-zero platform. I have CloudFlare set up in front of that, so I’m able to host and serve and 10s of Gb of data a month for less than $5/month.

Here’s what the Github Actions file looks like for the labordata.bunkum.us warehouse.

Observable for data analysis and visualisation

Observable is a lyrical platform for writing JavaScript notebooks for data analysis and visualisation. It has excellent support for working with databases and Datasette instances (using the JSON API I mentioned above).

Many of this notebooks are updated automatically, as the GitHub actions creates updated databases, which are pulled into the Datasette warehouses.

Being able to do arbitrarily complicated SQL queries across multiple tables and then working with the the analysis and visualisation all on a reactive front-end is very, very to fast to build.

Here are some examples:

I’m a bit worried about the free lunch ending with Observable some day, but for now it’s a pleasure.

January 28, 2022 - Weeknotes

2022-01-28T00:00:00+00:00

Labor Data

I want to do a little analysis that decomposes the decline of overall labor density into sectoral changes in overall employment and within sector changes in union density.

The Bureau of Labor Statistics has the data on union density broken down by sector, over time. They even have a pretty nice web API to get that data in convenient format.

Unfortunately, the web api does not have CORS set up, so I can’t pull that data into Observable. This is, unfortunately, a common problem of many government web APIs.

If an API does not have CORS set up, then that effectively blocks any other website from using that data, without complicated workarounds. It negate the main purpose of setting up a web API in the first place.

In a overwrought digression, I’m setting up a proxy to the BLS API that sets right the CORS headers so that data from the BLS can be pulled into other sites.

Prompted by a suggestion by David Eads, the proxy is a Cloudflare worker. It was pretty easy to get set up, because Cloudflare already had a recipe for making a CORS proxy.

I’m adapting this recipe to specialize it for the BLS site and to cache queries so that I can share it with others without feeling worried that the project will put too big a load on the BLS’s API.

I have the guts of it working, and got the code on Github. I need to write a landing page, refactor the code a little bit, add some docs, and then I’ll be good to push it out.

January 21, 2022 - Weeknotes

2022-01-21T00:00:00+00:00

Dedupe

I did a lot of gardening of the dedupe library, and cut the 2.0.9 version of the library. The big enhancement in this release was refactoring the parallelization of scoring pairs.

Previously, worker processes pulled chunks of record pairs off a queue, scored them, and wrote the results to a memmapped numpy array. Each chunk got its own array backed by a separate file. Then, the worker would put the name of the file into a result queue.

A collector process would read those filenames from the result queue and open those files and copy the content into as single, big memmapped numpy array.

I got rid of that collector process and now the scoring workers write directly to the same memmapped numpy array. The workers need to know where in the array they should write, and they communicate that through a shared memory value, with a built-in locking mechanism.

Python’s parallezation paradigm is based on multiple processes. Inter-process can often be more expensive than the benefits of using multiple cores, and this change gets rid of two ponts of inter-process communication.

It also, in my opinion, makes it easier to follow what’s going on.

That said, it probably won’t change performance hugely, since we were just communicating filepaths before and those are quite small. But still, it’s a nice simplification.

Musing on even less communication

The real win in performance would be to have the workers produce the record pairs themselves. I’ve been thinking about that for a while, but have not found an elegant way to do it.

The key problem is how to get workers access to the data. By default, we represent the record set as python dictionary. This dictionary can be quite big, and so we don’t want to make a copy of each for process because that could take up too much memory.

For the scoring workers, this dictionary should be read-only. If we were creating processes through forks, then we could set up the workers so they all had access to the same dictionary (i.e. same object in memory).

However, forking is not available on Mac OS and Windows, so parent memory is not shared in this way.

So, if we can’t use shared memory, then the second best alternative is to hold the data in database that each process can access independently.

There’s a lot that’s attractive about making that move. We are already using sqlite extensively in the library.

In order to not radically change the API, we would need to be able to load the records from a data dictionary into sqlite tables. The problem is that right now the fields of a record can be of any type, and sqlite does not have very rich set of native types.

We have comparators for strings, integers, and floats fields. Sqlite has native support for those. We also have comparators for tuples and sets and allow people to create arbitrary compartors that could take arbitrary types.

The json1 extension might be a reasonable way of handling tuples and sets. I could also potentially handle this by normalizing these array-ish values into a separate table.

It might be acceptable to limit the types that custom comparators can handle to strings, integers, floats, and arrays of those types.

Hmm… maybe it wouldn’t be too bad.

If we took this step, it would open us up to moving some of the blocking and even record comparison into sqlite, which could be quite nice.

Remarkable 2

I asked twitter for recommendations for e-readers that handled pdfs of articles well. The resounding recommendation was for a Remarkable 2. I ordered one with my DataMade annual office-equipment stipend, and it came on Thursday. So far, I really like it. It’s a cool, in a McLuhan sense, piece of technology.