<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/feed/by_tag/tech.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-01-23T04:14:55+00:00</updated><id>/feed/by_tag/tech.xml</id><entry><title type="html">Salesforce is just a really weird web framework</title><link href="/2025/12/18/what-is-salesforce.html" rel="alternate" type="text/html" title="Salesforce is just a really weird web framework" /><published>2025-12-18T00:00:00+00:00</published><updated>2025-12-18T00:00:00+00:00</updated><id>/2025/12/18/what-is-salesforce</id><content type="html" xml:base="/2025/12/18/what-is-salesforce.html"><![CDATA[<p>In the past year, I had to work with Salesforce for the first time as a long-time web developer.</p>

<p>It took me a while to understand what it actually was as a piece of technology, and when I did understand, I was horrified.</p>

<p>Salesforce says that it is a CRM (or Customer Relationship Management software), but that means almost nothing.</p>

<p>Salesforce comes default with some core “objects” like “Account” (a business-like thing) and “Contact” (a person-like thing). There are also default relationships set up between the core objects, so that you can represent that a particular contact works for some business. There are also pages set up so that you can see lists of these objects, see the details of a particular objects, and create new instances of objects.</p>

<p>These default settings presumably go some distance in helping capture the information that a sales team often wants when selling things to businesses.</p>

<p>However, that default functionality is not really what Salesforce is.</p>

<p>Here’s the thing. You can create new types of objects or change the fields that exist on the existing core objects. You can define new relationships between objects. You can, within broad limits, decide what information appears on the listing and detail pages for objects.</p>

<p>If you are web developer, then this is just a variation of MVC. Where objects are Models, pages are Views, and the Controller logic is smeared around different configuration locations.</p>

<p>What that means is that you can build web applications of arbitrary complexity within Salesforce, mainly through their web interface.</p>

<p>A web application of arbitrary complexity that’s not really under version control, and which is not really fully testable, and which is really only works within single, proprietary environment. Yikes!</p>

<p>On the other hand, it’s serverless. Also, most of the security issues are SalesForce’s problems not yours.</p>

<p>Politically, Salesforce has other advantages. For organizations, building the web application in SalesForce often only requires the unit to get authorization for purchasing a single service, whereas as building the application in a normal programming language and deployed to normal servers would require the coordination and collaborating with the IT department. Additionally, the build within Salesforce can sometimes be characterized as OpEx instead of CapEx, which can sometimes be helpful.</p>

<p>So Salesforce is a way of building web applications without fully acknowledging that’s what you are doing. It’s an impressive technical achievement.</p>]]></content><author><name>Forest Gregg</name></author><category term="tech" /><summary type="html"><![CDATA[The horrifying but ultimately admirable truth about what Salesforce really is.]]></summary></entry><entry><title type="html">Snakemake for PDF text extraction is pretty pleasant</title><link href="/2024/03/22/snakemake-text-extraction.html" rel="alternate" type="text/html" title="Snakemake for PDF text extraction is pretty pleasant" /><published>2024-03-22T00:00:00+00:00</published><updated>2024-03-22T00:00:00+00:00</updated><id>/2024/03/22/snakemake-text-extraction</id><content type="html" xml:base="/2024/03/22/snakemake-text-extraction.html"><![CDATA[<p>For <a href="https://chicago.councilmatic.org/">Chicago Councilmatic</a>, we’ve wanted to experiment with using a
large language model to write abstracts for the bills.</p>

<p>To do that, we needed the text of the legislation, which are published
as PDFs and Microsoft Word files. Text extraction for the Word files
is pretty easy, but text extraction for PDFs are not.</p>

<p>In our experience, the least maddening way to get text out of PDFs is
to turn each page of the PDF into an image, use an OCR tool like
<code class="language-plaintext highlighter-rouge">tesseract</code> to turn that image into text, and then recombine that text
back into a single file. OCR is a compute intensive task, so we need
to parallelize that task to get good throughput.</p>

<p>I’ve written a data pipeline to do that before in a Makefile, and it
was hard to write and even harder to read. This time, we wrote it
using <a href="https://snakemake.readthedocs.io/en/stable/"><code class="language-plaintext highlighter-rouge">snakemake</code></a> and
it was much, much better.</p>

<p>Below is the heavily annotated <code class="language-plaintext highlighter-rouge">Snakefile</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># snakemake defines a domain specific language (DSL), but 
# everything that it does not parse as part of that DSL, 
# it interprets as normal python.
import csv
import pathlib


def text_files(wildcards):
    """
	As the result of some process, we have a CSV with 
	the urls of documents we need to download and 
	process. So, we make a list of the text files that 
	we will ultimately produce. If the orginal document 
	is called "example.pdf," we will want to produce a 
	text file called "example.pdf.txt".
	"""

    with open('urls.csv') as f:
        reader = csv.DictReader(f)
        file_name = [row["url"] for row in reader]

    return expand("{file_name}.txt", file_name=file_name)


# This is default target rule. Running Snakemake will try 
# to extract the text from every document in `urls.csv`
rule all:
    input: text_files


# We have to handle both docx and pdf documents. This is 
# the rule for the docx files. Notice the 
# wildcard_constraint which is just regex.
rule to_text_docx:
    output: "{source_name}.txt"
    input: "{source_name}"
    wildcard_constraints:
        source_name="[a-z0-9-]+\.docx"
    shell:
        """
        pandoc -i {input} -t plain &gt; {output}
        """


def aggregate_texts(wildcards):
    """
	To process the pdfs, we will turn every page of 
	the pdf into a separate image, OCR that image, 
	and then recombine the text files. This function 
	gets the names of the individual page files 
	(which we can't know until we turn the pdf into a 
	bunch of page-images) and turns those into the 
	names of files we will use as dependencies for 
	recombining into a single text file.
	
	Notice that there are *two* kind of wildcards in 
	this text tranformation, which is very ugly to do 
	within a Makefile
	"""
	

    image_directory = pathlib.Path(checkpoints.to_images.get(**wildcards).output[0])
    files = expand(
        f"text/{wildcards.source_name}/page-.txt",
        page_num=glob_wildcards(image_directory / "page-{page_num}.ppm").page_num,
    )
    return sorted(files)


rule to_text_pdf:
    output: "{source_name}.txt"
    input: aggregate_texts
    wildcard_constraints:
        source_name="[a-z0-9-]+\.(pdf|PDF)"
    shell:
        """
        cat {input} &gt; {output}
        """

rule tesseract:
    output: "text/{source_name}/page-{page_num}.txt"
    input: "images/{source_name}/page-{page_num}.ppm"
    shell:
        """
        mkdir -p text/{wildcards.source_name}
        tesseract -l eng --dpi 150 {input} text/{wildcards.source_name}/page-{wildcards.page_num} txt
        """

# This is the rule that actually turns the PDF into a 
# bunch of images. Notice that it is a "checkpoint" not 
# a "rule." This is how Snakemake allows you to do 
# dynamic dependencies. Also notice that the output is a 
# directory, which is a kind of target Makefiles do not 
# always handle well.
checkpoint to_images:
    output: directory("images/{source_name}/")
    input: "{source_name}"
    wildcard_constraints:
        source_name="[a-z0-9-]+\.(pdf|PDF)"
    shell:
        """
        mkdir {output}
        pdftoppm -r 150 {input} {output}/page
        """
</code></pre></div></div>

<p>This is still complex, but much clearer than the equivalent Makefile.</p>

<p>As of the the posting date of this article, you should use a previous version 
of snakemake.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install snakemake==7.32.4 PuLP==2.3.1
</code></pre></div></div>

<p>The developers of snakemake recently completed a major refactor, and some of the
checkpoint handling has had regressions, I’m sure it will be fixed soon.</p>]]></content><author><name>Forest Gregg</name></author><category term="tech" /><summary type="html"><![CDATA[A Snakefile for writing a PDF text extraction pipeline is a lot nicer than an equivalent Makefile.]]></summary></entry><entry><title type="html">It’s hard to express a fan-out data flow in Makefiles</title><link href="/2024/03/13/makefiles-fan-out.html" rel="alternate" type="text/html" title="It’s hard to express a fan-out data flow in Makefiles" /><published>2024-03-13T00:00:00+00:00</published><updated>2024-03-13T00:00:00+00:00</updated><id>/2024/03/13/makefiles-fan-out</id><content type="html" xml:base="/2024/03/13/makefiles-fan-out.html"><![CDATA[<p>We’ve been using Makefiles for our reproducible data builds at
DataMade for years, and it’s been okay.</p>

<p>The data workflows that are hardest to express in Makefiles are
fan-out flows, which are unfortunately very common.</p>

<p>Here’s a simple example of a fan-out (actually a fan-out and fan-in or
scatter-gather flow): download a zip file that contains many CSVs,
remove some unneeded lines from each CSV, and then stack all the
processed CSVs into a single large file.</p>

<p>This can be okay, if you know the names of all the files ahead of
time, but if you dont’t, you have to write a recursive
makefile. Besides being a bit of a boggle, the Makefile no longer is
representing a real dependency between creating the data directory and
creating the final data. Instead, you have to call targets in the
right order in the <code class="language-plaintext highlighter-rouge">PHONY</code> target.</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.PHONY</span><span class="o">:</span> <span class="nf">all</span>
<span class="nl">all </span><span class="o">:</span>
	make data
	<span class="p">$(</span>MAKE<span class="p">)</span> complete.csv

<span class="c"># Stack all the trimmed CSVs into single file
</span><span class="nl">complete.csv </span><span class="o">:</span> <span class="nf">$(patsubst data/%</span><span class="p">,</span><span class="nf">%.trimmed</span><span class="p">,</span><span class="nf">$(wildcard data/*.csv))</span>
	csvstack <span class="nv">$^</span> <span class="o">&gt;</span> <span class="nv">$@</span>

<span class="c"># Each CSV needs a few lines trimmed from the top of the file
</span><span class="nl">%.csv.trimmed </span><span class="o">:</span> <span class="nf">data/%.csv</span>
	<span class="nb">tail</span> +4 <span class="nv">$&lt;</span> <span class="o">&gt;</span> <span class="nv">$@</span>

<span class="c"># Unzip a bunch of CSVs
</span><span class="nl">data </span><span class="o">:</span> <span class="nf">data.zip</span>
	unzip <span class="nv">$&lt;</span> <span class="nt">-d</span> <span class="nv">$@</span>
</code></pre></div></div>

<p><a href="https://snakemake.readthedocs.io/en/stable/">Snakemake</a> improves on
this a bit with its checkpoint syntax. It is better than a recursive
call, but here, too, we can’t clearly read off the whole chain
of dependency relations.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def csv_dependencies(wildcards):
    output_dir = checkpoints.unzip.get(**wildcards).output[0]
    file_names = expand("{name}.csv.trimmed", 
                        name = glob_wildcards(os.path.join(output_dir, "{name}.csv")).name)
    return file_names

rule all:
    input: csv_dependencies
    output: "complete.csv"
    shell: "csvstack {input} &gt; {output}"

rule:
    input: "data/{name}.csv"
    output: "{name}.csv.trimmed"
    shell: "tail +4 {input} &gt; {output}"

checkpoint unzip:
    input: "data.zip"
    output: directory("data")
    shell: "unzip {input} -d {output}"
</code></pre></div></div>

<p>One reason it’s hard to to express the dependency of the CSVs in the
data directory and <code class="language-plaintext highlighter-rouge">data.zip</code> is that we need to ultimately resolve
the files that <code class="language-plaintext highlighter-rouge">complete.csv</code> depends on. There can be many
intermediate steps between fanning out the data into the <code class="language-plaintext highlighter-rouge">data</code> directory and fanning it back in to <code class="language-plaintext highlighter-rouge">complete.csv</code>, and so we might creating a very long distance
reference, where it would be better to have somethign more local.</p>

<p><a href="https://gittup.org/tup/">Tupfiles’</a> bottom-up syntax is better for
this. Assuming that the data directory already exists, a Tupfile might
look like this</p>

<pre><code class="language-tupfile">: foreach data/*.csv |&gt; tail +4 %f &gt; %o |&gt; %B.csv.trimmed
: *.csv.trimmed |&gt; csvstack %f &gt; %o |&gt; complete.csv
</code></pre>

<p>I was not able to get <code class="language-plaintext highlighter-rouge">tup</code> running without setting up FUSE, a kernel
extension, which I did not want to do. But from reading the docs, I
don’t think there’s a simple way to express the extraction of the CSVs
from the zipfile. I’m not even sure it’s possible to run a single
Tupfile command and build everything.</p>

<p>It makes sense that build systems that were orginally and primarily built for building
programs do not handle fan-outs well. They show up rarely when creating programs.</p>

<p>But fan-outs are very common for data builds, and data builds are
very common, and it would be very nice to have a syntax that handled
them elegantly.</p>

<p>If you know of a build system that does this well, please let me know!</p>

<p>(I know Airflow and Prefect and their sistren exist. I’m looking for a
build system that a reasonable person would actually use on a small,
one-off data build)</p>

<p><strong>Addendum</strong>: The problem I’m describing here is a species of “dynamic
dependency,” described well in the lovely paper, <a href="https://www.microsoft.com/en-us/research/uploads/prod/2020/04/build-systems-jfp.pdf">“Build Systems à la
Carte: Theory and
Practice”</a>.</p>]]></content><author><name>Forest Gregg</name></author><category term="tech" /><summary type="html"><![CDATA[Makefiles, Snakefiles, and Tupfiles don't let you express fan-out data well.]]></summary></entry><entry><title type="html">A sqlite data layer for dedupe?</title><link href="/2023/01/11/dedupe-sqlite.html" rel="alternate" type="text/html" title="A sqlite data layer for dedupe?" /><published>2023-01-11T00:00:00+00:00</published><updated>2023-01-11T00:00:00+00:00</updated><id>/2023/01/11/dedupe-sqlite</id><content type="html" xml:base="/2023/01/11/dedupe-sqlite.html"><![CDATA[<p>Right now, the <a href="https://github.com/dedupeio/dedupe">dedupe library</a>
ultimately expects data to be represented as a stream of Python
dictionaries. This design decision has made the library very flexible,
since it does not need to know anything in particular about how the
data is originally stored.</p>

<p>However, this design has two important costs. First, it substantially
limits the places where dedupe can profitably use parallel processing
to take advantage of multiple CPUs. Second, many of the operations of
the library could be done much faster if it was able to know more
about and therefore cooperate more effectively with the data layer.</p>

<p>If the library was built with the expectation that the data was stored
in a sqlite database, the base library could significantly increase
the scale of data that could be processed into the tens of millions</p>

<h2 id="benefits-of-a-sqlite-data-layer">Benefits of a sqlite data layer</h2>

<h3 id="costs-of-interprocess-communication">Costs of interprocess-communication</h3>

<p>Blocking is the clearest example or a problem that should be easy to
do in parallel, but one where we get no advantage of parallel
processing with our current architecture.</p>

<p>Basically, blocking is applying a kind of hash function to thousand or
millions of records where order does not matter. The way that we do
this now is apply the blocking function to a stream of records
represented as python dictionaries, effectively:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">block_keys</span> <span class="o">=</span> <span class="nf">map</span><span class="p">(</span><span class="n">blocking_function</span><span class="p">,</span> <span class="n">data</span> <span class="n">stream</span><span class="p">)</span>
</code></pre></div></div>

<p>which can be easily parallelized as</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">multiprocessing</span>

<span class="n">pool</span> <span class="o">=</span> <span class="n">multiprocessing</span><span class="p">.</span><span class="nf">pool</span><span class="p">(</span><span class="n">NUM_PROCESSES</span><span class="p">)</span>

<span class="n">block_keys</span> <span class="o">=</span> <span class="n">pool</span><span class="p">.</span><span class="nf">imap_unordered</span><span class="p">(</span><span class="n">blocking_function</span><span class="p">,</span> 
                                 <span class="n">data</span> <span class="n">stream</span><span class="p">)</span>
</code></pre></div></div>

<p>But this ends up not being very useful, because of interprocess communication.</p>

<p>Basically, the multiprocessing <code class="language-plaintext highlighter-rouge">imap</code> works like this: the parent
process will pull a chunk of the data from the data stream, pickled
the data to a byte-string, and send the pickled data over a socket to
a child process. The child process will listen on the socket, unpickle
the chunk of data, apply the block_function, then pickle the resulting
block keys, communicate the bytestring back to the parent process,
which, finally, deserializes it.</p>

<p>If the actual application of blocking function is computationally
cheap, then the all benefits are having more than one core working on
the problem overwhelmed by the overhead of all that serializing and
deserializing.</p>

<p>If the data is stored in a relational database, we could parallelize
by having each process separately connect to the database, pull their
own chunk of data from the database, calculate the block keys and
write those block keys for the chunk to the database. While we still
need to effectively serialize/deserialize in our communication with
the database, the number of times we need to pay for that overhead can
be hugely reduced and the per-call overhead will also typically be
much smaller than pickling/unpickling.</p>

<h3 id="pushing-operations-to-the-data-layer">Pushing operations to the data layer</h3>

<p>If can build off a relational database, then some of work that that
library is currently doing fairly slowly in Python could be done much
more quickly in the data layer.</p>

<p>Blocking, again, is clear example.</p>

<p>Many of our blocking functions are very simple. For example, take the
first seven characters of the address field. Even with a good parallel
processing model, it will be very hard for a Python solution to beat</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">block_keys</span> <span class="p">(</span><span class="k">key</span><span class="p">,</span> <span class="n">record_id</span><span class="p">)</span>
<span class="k">SELECT</span>
    <span class="k">substring</span><span class="p">(</span><span class="n">address</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">7</span><span class="p">),</span>
    <span class="n">record_id</span>
<span class="k">FROM</span>
    <span class="k">data</span><span class="p">;</span>
</code></pre></div></div>

<h3 id="collateral-benefits">Collateral benefits</h3>

<p>While the lowest level dedupe methods operate over streams of
dictionaries, the high level API assumes that the data represented as
a Python dictionary, necessarily stored in memory. This requirement
means that the users of the high level API face a limit in how much
data they can process as it has to fit in memory.</p>

<p>Moving to an architecture that requires the data to be stored in a
relational database would remove that restriction.</p>

<h2 id="downsides-of-a-sqlite-data-layer">Downsides of a sqlite data layer</h2>

<h3 id="type-inflexibility">Type Inflexibility</h3>

<p>Python objects are very flexible, and dedupe has taken advantage of
that by having comparators that work over numbers, strings, tuples,
and sets.</p>

<p>Relational databases do not have that flexibility. Core sqlite
supports floats, integers, and strings, and bytes and that’s basically
it.</p>

<p>If we use sqlite as the database layer. Then we will either lose the
ability to use compare some types of objects, have to use type
adaptation, or represent the data more indirectly.</p>

<p><a href="https://docs.python.org/3.10/library/sqlite3.html#how-to-adapt-custom-python-types-to-sqlite-values">Type adaptation</a>
is another layer of serialization and deserialization on top of the
serialization from sqlite data to python objects. Type adaptations are
written in Python and can introduce significant overhead.</p>

<p>Collection objects like tuples, arrays, and sets can also be
represented in sqlite as normalized tables. This strategy would
significantly increase the complexity of the code of serializing and
deserializing a record from python to sqlite and back. On the other
hand, if the data is represented in a normalized form in the database
then some of the dedupe operations over collections could be pushed
down into the database layer.</p>

<p>Type adaptation is probably the best near-term strategy.</p>

<h3 id="table-definition-generation">Table definition generation</h3>

<p>if we want to keep something like our existing API, then we need code
to automatically convert a Python dictionary of record to a sqlite
table, which is going to require inferring a SQL table definition from
the Python data.</p>

<p>This is complex. This some reasonable prior art to follow from
<a href="https://github.com/pandas-dev/pandas/blob/8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7/pandas/io/sql.py#L1845-L1986">pandas</a> or
<a href="https://github.com/simonw/sqlite-utils/blob/fc221f9b62ed8624b1d2098e564f525c84497969/sqlite_utils/db.py#L744">sqlite-utils</a>,
but we would need to write our own, and it’s a boring and bug-prone
piece of functionality.</p>

<p>If we started our library with an existing sqlite table or pandas data
frame, then we could side step much of this, but that would be a very
big departure from the the existing API and would also impose limits
on type flexibility.</p>

<h3 id="supporting-multiple-database-engines">Supporting multiple database engines</h3>

<p>While i would recommend that the core library be written with sqlite as the database, many users will want a different database engine.</p>

<p>If we support that, then that will be another layer of complexity to manage. Something like sqlalchemy may be able to help some, but we are going to need write engine-specific queries for to take advantage of the different facilities of different engines.</p>]]></content><author><name></name></author><category term="tech" /><summary type="html"><![CDATA[Thinking through the benefits and costs of sqlite as a data layer for the dedupe library]]></summary></entry><entry><title type="html">The MGDO Stack</title><link href="/2022/11/20/mgdo-stack.html" rel="alternate" type="text/html" title="The MGDO Stack" /><published>2022-11-20T00:00:00+00:00</published><updated>2022-11-20T00:00:00+00:00</updated><id>/2022/11/20/mgdo-stack</id><content type="html" xml:base="/2022/11/20/mgdo-stack.html"><![CDATA[<p>Over the past year, I’ve refined a stack for my personal projects that has been productive and fun.</p>

<ol>
  <li>Makefile to produce a sqlite database</li>
  <li>Github Actions as an ETL and scraping platform</li>
  <li>Datasette as a public data warehouse</li>
  <li>Observable for data analysis and visualisation</li>
</ol>

<h2 id="makefile-for-a-sqlite-database">Makefile for a sqlite Database</h2>
<p>For each dataset, I’ll make a repository that turns source data into a sqlite database with a single <code class="language-plaintext highlighter-rouge">make</code> command. The repositories <a href="https://github.com/fgregg/warehouse-etl">follow this template</a>.</p>

<p><a href="https://csvkit.readthedocs.io/en/latest/">csvkit</a>, <a href="https://sqlite-utils.datasette.io/en/stable/">sqlite-utils</a>, and <a href="https://pypi.org/project/csvs-to-sqlite/">csvs-to-sqlite</a> are often the workhouses of the ETL code. Thanks <a href="https://twitter.com/onyxfish">Christopher Groskopf</a>, <a href="http://www.jamespetermckinney.com/">James McKinney</a> and <a href="https://fedi.simonwillison.net/@simon">Simon Willison</a>!</p>

<p>Here are some examples:</p>
<ul>
  <li><a href="https://github.com/fgregg/ilcampaigncash">Illinois Campaign Finance Database</a></li>
  <li><a href="https://github.com/Chicago-Data-Collaborative-Schools/locations-boundaries">Boundaries and Locations of Chicago Public Schools</a></li>
</ul>

<h2 id="github-actions-for-etl-and-scraping">Github Actions for ETL and Scraping</h2>
<p><a href="https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions">GitHub Actions</a> is almost the perfect platform for running ETL jobs and web scraping. It has just about everything you could want.</p>

<ul>
  <li>Schedulable jobs</li>
  <li>Execution on demand</li>
  <li>Red / Green dashboard for job success</li>
  <li>Email notifications if something goes wrong</li>
  <li>Execution lives next to code</li>
  <li>Serverless</li>
  <li>Simple management of secrets</li>
  <li>Storage of large artifacts</li>
  <li>Parallelism</li>
  <li>A large user base</li>
  <li>Free! (For public repositories)</li>
</ul>

<p>The only real limitation I’ve run into is that execution time for a single job is limited to six hours, which can be constraining for large scrapes. Getting around this can take some creativity. Often the best solution is to split the job into smaller bites and run many parallel jobs.</p>

<p>Another small challenge is what to do with the large artifacts produced by an ETL. What I do is manually create a release on the github repository, and then use <a href="https://github.com/WebFreak001/deploy-nightly">this github action</a> to stuff the artifacts in the release. I bet I could smooth this over if I wrote a custom Github Action but I haven’t tried yet.</p>

<p>The limit on artifact size attached to a release is 100Gb which has been quite enough so far.</p>

<p>Here’s how I set up <a href="https://github.com/fgregg/warehouse-etl/blob/main/.github/workflows/build.yml">the Github Actions script</a>.</p>

<h3 id="private-repositories">Private Repositories</h3>
<p>For private jobs, GitHub you get 2000-3000 minutes of execution time for free a month depending on your account type, and then Github charges $0.008/per minute  after that.</p>

<p>That can get expensive, but GitHub allows you to dispatch github action jobs on your servers. Azure spot instances + <a href="https://cirun.io/">cirun.io</a> makes intensive use of GitHub actions on private repositories affordable.</p>

<p>That GitHub is owned by Microsoft, and that  I can pay for GitHub actions and also have an option to pay someone else for the server-time are all some comfort on persistence of the service.</p>

<h2 id="datasette-as-a-public-data-warehouse">Datasette as a public data warehouse</h2>
<p>If you are building things for the web, you need to take <a href="https://en.wikipedia.org/wiki/SQL_injection">extraordinary care to prevent users of your website from making arbitrary queries</a> against your database. The core conceit of Simon Willison’s <a href="https://datasette.io/">Datasette</a> project is “What if you didn’t?”</p>

<p>Datasette allows unauthorized users to make arbitrary <code class="language-plaintext highlighter-rouge">SELECT</code> queries against sqlite databases, and that ends up being a really powerful thing to do.</p>

<p>I use it to collect all the sqlite databases that I build into a <a href="https://labordata.bunkum.us">publicly</a> <a href="https://puddle.bunkum.us">accessible</a> <a href="https://data.thefoiabakery.org">data warehouses</a>. Folks can ask their own questions of the data, share queries, or download the entire databases.</p>

<p>To my mind, the most important feature of Datasette is that for any query, you can get the results back as JSON. This means the websites provides an JSON API that uses SQL directly. It’s amazing.</p>

<p>I have GitHub Actions that run nightly to collect all the databases and pushes the data and code to Google Cloudrun, a scale-to-zero platform. I have CloudFlare set up in front of that, so I’m able to host and serve and 10s of Gb of data a month for less than $5/month.</p>

<p>Here’s what the <a href="https://github.com/labordata/warehouse/blob/main/.github/workflows/build.yml">Github Actions file</a> looks like for the <a href="https://labordata.bunkum.us">labordata.bunkum.us</a> warehouse.</p>

<h2 id="observable-for-data-analysis-and-visualisation">Observable for data analysis and visualisation</h2>
<p><a href="https://observablehq.com">Observable</a> is a lyrical platform for writing JavaScript notebooks for data analysis and visualisation. It has excellent support for working with databases and Datasette instances (using the JSON API I mentioned above).</p>

<p>Many of this notebooks are updated automatically, as the GitHub
actions creates updated databases, which are pulled into the Datasette
warehouses.</p>

<p>Being able to do arbitrarily complicated SQL queries across multiple tables and then working with the the analysis and visualisation all on a reactive front-end is very, very to fast to build.</p>

<p>Here are some examples:</p>

<ul>
  <li><a href="https://observablehq.com/@fgregg/distribution-of-days-from-filing-to-first-election">Distribution of days from filing to election</a></li>
  <li><a href="https://observablehq.com/d/1f3c5386c65501bf">CPS and Illinois report different graduation rates for Chicago high schools</a></li>
  <li><a href="https://observablehq.com/@fgregg/new-contracts-reported-by-anti-labor-consultants-in-lm-20-fi">New contracts reported by anti-labor consultants in LM-20 Filings</a></li>
</ul>

<p>I’m a bit worried about the free lunch ending with Observable some
day, but for now it’s a pleasure.</p>]]></content><author><name></name></author><category term="tech" /><summary type="html"><![CDATA[Small pieces loosely joined]]></summary></entry><entry><title type="html">January 28, 2022 - Weeknotes</title><link href="/2022/01/28/weeknotes.html" rel="alternate" type="text/html" title="January 28, 2022 - Weeknotes" /><published>2022-01-28T00:00:00+00:00</published><updated>2022-01-28T00:00:00+00:00</updated><id>/2022/01/28/weeknotes</id><content type="html" xml:base="/2022/01/28/weeknotes.html"><![CDATA[<h3 id="labor-data">Labor Data</h3>
<p>I want to do a little analysis that decomposes the decline of overall
labor density into sectoral changes in overall employment and within
sector changes in union density.</p>

<p>The Bureau of Labor Statistics has the <a href="https://www.bls.gov/webapps/legacy/cpslutab3.htm">data on union density broken
down by sector, over
time</a>. They even
have a pretty nice <a href="https://www.bls.gov/bls/api_features.htm">web API</a>
to get that data in convenient format.</p>

<p>Unfortunately, the web api does not have
<a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS">CORS</a> set up,
so I can’t pull that data into Observable. This is, unfortunately, a
common problem of many government web APIs.</p>

<p>If an API does not have CORS set up, then that effectively blocks any
other website from using that data, without complicated workarounds. It 
negate the main purpose of setting up a web API in the first place.</p>

<p>In a overwrought digression, I’m setting up a <a href="https://bls-api.bunkum.us">proxy to the BLS
API</a> that sets right the CORS headers so that
data from the BLS can be pulled into other sites.</p>

<p>Prompted by a <a href="https://twitter.com/eads/status/1486027015861985282">suggestion by David Eads</a>, the proxy is a <a href="https://workers.cloudflare.com/">Cloudflare worker</a>. It was pretty easy to get set up, because Cloudflare already had a recipe
for making a <a href="https://developers.cloudflare.com/workers/examples/cors-header-proxy">CORS proxy</a>.</p>

<p>I’m adapting this recipe to specialize it for the BLS site and to
cache queries so that I can share it with others without feeling
worried that the project will put too big a load on the BLS’s API.</p>

<p>I have the guts of it working, and <a href="https://github.com/fgregg/bls-proxy">got the code on
Github</a>. I need to write a
landing page, refactor the code a little bit, add some docs, and then
I’ll be good to push it out.</p>]]></content><author><name></name></author><category term="tech" /><summary type="html"><![CDATA[Weeknotes for January 28, 2022]]></summary></entry><entry><title type="html">January 21, 2022 - Weeknotes</title><link href="/2022/01/21/weeknotes.html" rel="alternate" type="text/html" title="January 21, 2022 - Weeknotes" /><published>2022-01-21T00:00:00+00:00</published><updated>2022-01-21T00:00:00+00:00</updated><id>/2022/01/21/weeknotes</id><content type="html" xml:base="/2022/01/21/weeknotes.html"><![CDATA[<h4 id="dedupe">Dedupe</h4>
<p>I did a <a href="https://github.com/dedupeio/dedupe/issues?page=2&amp;q=is%3Aissue+is%3Aclosed+closed%3A2022-01-15..2022-01-23">lot of gardening of the dedupe
library</a>,
and cut the 2.0.9 version of the library. The big enhancement in this
release was refactoring the <a href="https://github.com/dedupeio/dedupe/pull/936/files#diff-0af8d57e51708aa45e057ec83aa026a76f6750db803a41edf86054c80e54cc34">parallelization of scoring
pairs</a>.</p>

<p>Previously, worker processes pulled chunks of record pairs off a
queue, scored them, and wrote the results to a memmapped numpy
array. Each chunk got its own array backed by a separate file. Then,
the worker would put the name of the file into a result queue.</p>

<p>A collector process would read those filenames from the result queue
and open those files and copy the content into as single, big memmapped
numpy array.</p>

<p>I got rid of that collector process and now the scoring workers write
directly to the same memmapped numpy array. The workers need to know
where in the array they should write, and they communicate that
through a shared memory value, with a built-in locking mechanism.</p>

<p>Python’s parallezation paradigm is based on multiple
processes. Inter-process can often be more expensive than the benefits
of using multiple cores, and this change gets rid of two ponts of
inter-process communication.</p>

<p>It also, in my opinion, makes it easier to
follow what’s going on.</p>

<p>That said, it probably won’t change performance hugely, since we
were just communicating filepaths before and those are quite small.
But still, it’s a nice simplification.</p>

<h4 id="musing-on-even-less-communication">Musing on even less communication</h4>

<p>The real win in performance would be to have the workers produce the 
record pairs themselves. I’ve been thinking about that for a while, but
have not found an elegant way to do it.</p>

<p>The key problem is how to get workers access to the data. By default, we
represent the record set as python dictionary. This dictionary
can be quite big, and so we don’t want to make a copy of each for
process because that could take up too much memory.</p>

<p>For the scoring workers, this dictionary should be read-only. If we
were <a href="https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods">creating processes through
forks</a>,
then we could set up the workers so they all had access to the same
dictionary (i.e. same object in memory).</p>

<p>However, forking is not available on Mac OS and Windows, so parent memory is not shared in this way.</p>

<p>So, if we can’t use shared memory, then the second best alternative is
to hold the data in database that each process can access
independently.</p>

<p>There’s a lot that’s attractive about making that move. We are already
using sqlite extensively in the library.</p>

<p>In order to not radically change the API, we would need to be able to
load the records from a data dictionary into sqlite tables. The
problem is that right now the fields of a record can be of any type,
and sqlite does not have very rich set of native types.</p>

<p>We have comparators for strings, integers, and floats fields. Sqlite
has native support for those. We also have comparators for tuples and
sets and allow people to create arbitrary compartors that could take
arbitrary types.</p>

<p>The <a href="https://www.sqlite.org/json1.html"><code class="language-plaintext highlighter-rouge">json1</code></a> extension might be a
reasonable way of handling tuples and sets. I could also potentially 
handle this by normalizing these array-ish values into a separate table.</p>

<p>It might be acceptable to limit the types that custom comparators can
handle to strings, integers, floats, and arrays of those types.</p>

<p>Hmm… maybe it wouldn’t be too bad.</p>

<p>If we took this step, it would open us up to moving some of the
blocking and even record comparison into sqlite, which could be quite
nice.</p>

<h3 id="remarkable-2">Remarkable 2</h3>
<p><a href="https://twitter.com/forestgregg/status/1482503176934891521">I asked twitter for recommendations for e-readers that handled pdfs
of articles
well</a>. The
resounding recommendation was for a Remarkable 2. I ordered one with
my DataMade annual office-equipment stipend, and it came on
Thursday. So far, I really like it. It’s a cool, in a McLuhan
sense, piece of technology.</p>]]></content><author><name></name></author><category term="tech" /><summary type="html"><![CDATA[Weeknotes for January 21, 2022]]></summary></entry></feed>