Blarg

2025-07-09

Welcome

This is Blarg, version 3 (probably). It used to be an extremely complicated contraption involving auto-generated Makefiles. But I think even if I write several entries a day for the rest of my life, I’ll never have more than 100,000 files to deal with, and it just doesn’t seem worth the complexity. So now, I think a better way to Blarg is to just have a Jupyter notebook that builds the whole thing in a forward pass. Focus on writing, not perfect efficiency!

Why not Quarto? Well, because I want to learn. I want the output to be as simple, minimal, and understandable as possible, and usually the best way to do that is by removing tools from the stack instead of adding more.

Quarto’s popularity for technical publishing looks like it’s really a growing popularity of Pandoc — or more fairly, Quarto is making Pandoc easy enough to use that more people are using it. Why not just use Pandoc directly?

Globals

Here is some stuff we’ll need:

import json
import os
from contextlib import chdir
from mimetypes import add_type, guess_type
from pathlib import Path
from shutil import copyfile
from subprocess import PIPE, run
from typing import Iterator
from urllib.parse import quote
from urllib.request import urlopen

import bibtexparser
import pandas as pd
import toolz.curried as tz
from pandas import DataFrame, Series, Timestamp
from slugify import slugify
from tqdm.auto import tqdm

Root URL of the site. In general, the build process attempts to use relative URIs everywhere. This is currently only used for the feed generation.

SITEURL = "https://danielgrady.net"

Support files for Pandoc: filters and templates.

PANDOCDATA = str(Path.cwd() / "pandoc-data")
PANDOCDATA

The path to the hierarchy of source files.

ROOT = Path.cwd().parent.parent
ROOT

BibLaTeX bibliography.

REF = ROOT / "ref/references.bib"

Directory to write out the built site.

OUT = ROOT / ".build"
OUT.mkdir(exist_ok=True)
OUT

Directory for cache files.

CACHE = ROOT / ".cache"
CACHE.mkdir(exist_ok=True)
CACHE

Path to use for automatically generated BibLaTeX entries.

REFAUTO = CACHE / "references-auto.bib"

Blarg ignores files and directories with leading dots, and it also ignores the following top-level directories:

IGNOREDIRS = [str(ROOT / p) for p in ["ref", "template"]]

Indexing

The first step is to get a comprehensive index of all the source files. Create a table of every file, and then split the table into separate indexes for documents and other files, called “assets.”

Identify documents using IANA media types. This puts all the logic around file extensions and such into one place.

add_type("text/markdown", ".md")
add_type("text/markdown", ".mdown")
add_type("text/markdown", ".markdown")
add_type("text/x-org", ".org")
add_type("application/ipynb+json", ".ipynb")

This dictionary maps the IANA mediatypes that Blarg considers to be "documents" to the Pandoc reader format string to use for parsing the document.

DOCUMENT_MEDIATYPES = {
    "text/markdown": "markdown+wikilinks_title_after_pipe",
    "application/ipynb+json": "ipynb+wikilinks_title_after_pipe",
    "text/x-org": "org",
}

Index all input files

“Indexing” an individual file just means looking up file-level metadata from the filesystem, and guessing what media type the file is using the standard library’s tool.

def load_file_metadata(p: Path) -> dict:
    """
    General metadata for a file

    Fields are named as in `stat`:

    - st_birthtime: when the file was created
    - st_atime: last access time
    - st_mtime: file contents modified
    - st_ctime: on macOS, file metadata modified
    """
    stat = p.stat()
    result = dict()
    mediatype, compression = guess_type(p)
    tmp = p.stat()
    result.update(
        {
            "path": p,
            "mediatype": mediatype,
            "compression": compression,
            "size": stat.st_size,
            "st_birthtime": tmp.st_birthtime,
            "st_atime": tmp.st_atime,
            "st_mtime": tmp.st_mtime,
            "st_ctime": tmp.st_ctime,
        }
    )
    return result

Index all files under the root, ignoring directories and files with leading dots.

def files_under(root: Path) -> Iterator[Path]:
    """
    Yield paths to files in the hierarchy at `root`

    Yield only files, not directories

    Ignore files and directories with a leading dot
    """
    # This relies on a weird but documented and recommended behavior - modify the list of subdirs
    # inside the loop to inform `os.walk` to avoid certain subdirectories.
    for directory, subdirs, files in os.walk(root):
        if directory in IGNOREDIRS:
            continue
        hidden_subdirs = [p for p in subdirs if p.startswith(".")]
        for p in hidden_subdirs:
            subdirs.remove(p)
        
        hidden_files = [p for p in files if p.startswith(".")]
        for p in hidden_files:
            files.remove(p)

        dp = Path(directory)
        for file in files:
            yield dp.joinpath(file)
def index_tree(root: Path) -> DataFrame:
    """
    Create an index of files under ``root``

    Get filesystem metadata for each file, as well as inferred mimetypes and compression
    """
    idx = list()
    for p in files_under(root):
        idx.append(load_file_metadata(p))
    idx = DataFrame(idx)
    idx.insert(0, "relpath", idx["path"].apply(lambda p: p.relative_to(root)))
    return idx
idx = index_tree(ROOT)

Ignore certain kinds of files.

mask = idx["path"].apply(lambda p: p.suffix in (".canvas", ".pxm"))
idx = idx[~mask].set_index("path", drop=False).sort_index().copy()
idx.head(5)

Create asset and document indexes

Files are either assests, or documents.

Assets will be just copied to the site directory, with some slight modification to their parent path.

A document has additional, arbitrary, metadata from the file’s frontmatter, and Blarg will additionally infer or adjust some metadata.

Documents can contain hyperlinks (point internally or externally), wiki links (point internally, fuzzy search), and citations.

Citations are identified with cite keys. A cite key is a URI, and might be listed in the bibliography.

is_doc = idx["mediatype"].isin(DOCUMENT_MEDIATYPES)

Yes. This is a good name.

assidx = idx[~is_doc].copy()
docidx = idx[is_doc].copy()

Documents have all the same indexing information as assets, and get other stuff in addition.

def load_document_metadata(p: Path, mediatype: str) -> dict:
    """
    Get metadata for a document

    This loads the information the document records about itself. The filesystem has other things to
    say about the file containing the document, not handled here.

    This function uses Pandoc to extract YAML front matter, and also a mapping that includes all
    cite keys, URL link targets, and eventually other things.

    The trick to making this work is using a Pandoc template that contains nothing except the
    `meta-json` template variable.
    """
    # fmt: off
    args = [
        "pandoc",
        "--from", DOCUMENT_MEDIATYPES[mediatype],
        "--to", "commonmark", "--standalone",
        "--data-dir", PANDOCDATA,
        "--template", "metadata.pandoctemplate",
        "--lua-filter", "analyze-document.lua",
        str(p),
    ]
    # fmt: on
    proc = run(args, check=True, stdout=PIPE)
    frontmatter = json.loads(proc.stdout)
    docmap = frontmatter["docmap"]
    del frontmatter["docmap"]
    for _, stuff in docmap.items():
        stuff["order"] = int(stuff["order"])
        stuff["level"] = int(stuff["level"])
    result = {"fm": frontmatter, "docmap": docmap}
    return result
docidx.iloc[0]
entry = docidx.iloc[0]
load_document_metadata(entry["path"], entry["mediatype"])
tmp = {p: None for p in docidx["path"]}
for _, entry in tqdm(docidx.iterrows(), total=len(docidx)):
    p = entry["path"]
    mt = entry["mediatype"]
    tmp[p] = load_document_metadata(p, mt)
docmeta = Series(tmp)
docidx["frontmatter"] = docmeta.apply(lambda d: d["fm"])
docidx["docmap"] = docmeta.apply(lambda d: d["docmap"])

At this point, docidx includes filesystem metadata, all of the document's frontmatter (if any), and a document map.

docidx.head(5)

Process the indexes

Next, calculate several pieces of derived metadata:

(NB The Atom specification works the other way around with respect to timestamps: updated is required, published is optional.)

Note that, in an earlier iteration of this notebook, log entries and notes were more clearly distinguished. Now, they are (should be) exactly the same, just different places to put things. The dates for log entries come from the front matter and filesystem, and note from the path to the entry.

There may be other fields present in the document front matter that will be rendered in the final output based on the template, for example subtitle.

pd.set_option('future.no_silent_downcasting', True)
# Convert, or assume, all timestamps to Pacific time

fmdates = docidx["frontmatter"].apply(
    lambda d: Timestamp(d["published"], tz="US/Pacific") if "published" in d else None
)

# This is a very annoying feature of Pandas. In `Timestamp.fromtimestamp(x)`, x is always an
# absolute POSIX timestamp. Calling the function like that returns a timezone-*naive* Timestamp, but
# where `x` has been converted to display in the running system's *local* time. Calling
# `Timestamp.fromtimestamp(x, tz=TZ)` returns a timezone-*aware* Timestamp, with x converted to that
# timezone.
fsdates = docidx["st_birthtime"].apply(
    lambda x: Timestamp.fromtimestamp(x, tz="US/Pacific")
)

tmp = fmdates.combine_first(fsdates)
tmp = pd.to_datetime(tmp)

docidx["published"] = tmp
shorttitles = docidx["path"].apply(lambda p: p.stem)
fmtitles = docidx["frontmatter"].apply(lambda d: d.get("title"))
titles = fmtitles.combine_first(shorttitles)
docidx["title"] = titles
docidx["shorttitle"] = shorttitles

Generate a "site path" for every asset and document. The site path is the absolute path to the resource, as accessed via HTTP. The actual output file will be at site path + "index.html".

For assets, the site path and the output file path are the same.

For documents, the normal case is:

There are two special cases for documents:

This accommodates the Directory-based notes should repeat the directory name convention, and the older convention.

Relative path components are processed with slugify to get clean URL slug.

Document metadata may override the generated slug, which will replace the final component of the site path.

For regular (non-document) files, all path components except the filename are slugified. I think this will handle the common case of support files that are stored as siblings of the document.

def relpath2sitepath(p: Path, is_document=True):
    if is_document:
        p = p.with_suffix("")
        parts = p.parts
        if (parts[-1] == "index") or (len(parts) > 1 and parts[-2] == parts[-1]):
            parts = parts[:-1]
        parts = tuple(slugify(pt) for pt in parts)
    else:
        parts = p.parts
        parts = tuple(slugify(pt) for pt in parts[:-1]) + (parts[-1],)
    sitepath = Path().joinpath(*parts)
    return sitepath
docidx["sitepath"] = docidx["relpath"].apply(relpath2sitepath)
docidx["outpath"] = docidx["sitepath"].apply(lambda p: p.joinpath("index.html"))

assidx["sitepath"] = assidx["relpath"].apply(lambda p: relpath2sitepath(p, is_document=False))
assidx["outpath"] = assidx["sitepath"].copy()
docidx.head(5)
assidx.head(5)
# fmt: off
wikilink_targets = tz.pipe(
    docidx["docmap"],                   # Start with a list of all document maps; each maps header ID -> metadata
    tz.map(lambda dm: dm.values()),     # Extract just the metadata
    tz.concat,                          # Flatten the list of lists
    tz.map(tz.get("wikilinks")),        # Extract the wikilinks used under every heading
    tz.filter(None),                    # Remove empty sets
    tz.concat,                          # Flatten again
    tz.map(lambda s: s.split("#")[0]),  # Remove Obsidian-style heading references
    set,                                # Deduplicate
    list, Series,
)
# fmt: on

TODO: Should extend this to allow for wikilinks with absolute paths, for example:

wikilink_targets[lambda df: df.str.startswith("log/")]

TODO: And handle the case of references to static assets:

wikilink_targets[lambda df: df.str.contains("pdf")]

Create a mapping that goes from all the possible targets of wiki-style links to the corresponding sitepath in the output. The possible wiki-style link targets are the filename stems of all documents.

tmp_mapping = (
    docidx.join(docidx["relpath"].apply(lambda p: p.stem).rename("wikilink_target"))
    .drop_duplicates(subset=["wikilink_target", "sitepath"])
    .set_index("wikilink_target")["sitepath"]
    .sort_index()
)
tmp_mapping

I'm assuming that I've uniquely named all files.

assert tmp_mapping.index.is_unique
wikilink_map = wikilink_targets.map(tmp_mapping)
wikilink_map.index = wikilink_targets.values
wikilink_map = wikilink_map.dropna().sort_index().apply(str)
with open(CACHE / "wikilink-map.json", "w", encoding="UTF-8") as f:
    json.dump(wikilink_map.to_dict(), f)
wikilink_map

Generated stuff

Master bibliography and reference map

Every mentioned cite key or URL needs associated metadata.

First, find all the cite keys with manually prepared entries.

library = bibtexparser.parse_file(REF)
known_citekeys = [e.key for e in library.entries]
for e in library.entries:
    if "ids" in e:
        known_citekeys.append(e.get("ids").value)
known_citekeys = frozenset(known_citekeys)
len(known_citekeys)

Second, create a table that maps every mention of a cite key or URL to the sitepath + fragment where it's mentioned.

def docmap2mentions(d):
    result = []
    for fragment, data in d.items():
        for citekey in data["cites"]:
            result.append((fragment, citekey, "cite"))
        for link in data["links"]:
            result.append((fragment, link, "link"))
    return result
refmap = tz.pipe(
    docidx.iterrows(),
    tz.map(tz.get(1)),
    tz.map(lambda row: [(row["sitepath"], ) + t for t in docmap2mentions(row["docmap"])]),
    tz.concat,
    list,
    lambda lst: pd.DataFrame(lst, columns=["sitepath", "fragment", "uri", "type"])
)

refmap.sample(5, random_state=42)

Find all the mentioned cite keys that don't have a manually written entry.

mentioned_citekeys = frozenset(refmap[lambda df: df["type"].eq("cite")]["uri"])
len(mentioned_citekeys)
missing_keys = mentioned_citekeys - known_citekeys
missing_keys

Get bibliographic info for every missing cite key using Wikipedia's instance of Citoid, or the arXiv API directly. (Citoid does not seem to support arXiv article IDs.)

CITOID = "https://en.wikipedia.org/api/rest_v1/data/citation/bibtex/{query}"
ARXIV = "https://arxiv.org/bibtex/{query}"


def get_bibentry(query: str):
    url = ARXIV if query.startswith("arxiv:") else CITOID
    url = url.format(query=quote(query, safe=""))
    try:
        with urlopen(url) as f:
            data = f.read()
        result = data.decode("UTF-8")
        result = result.strip()
    except Exception:
        result = None
    return result
tmp = {k: get_bibentry(k) for k in tqdm(missing_keys)}
tmp
tmp2 = []
for k, v in tmp.items():
    library = bibtexparser.parse_string(v)
    for e in library.entries:
        e.key = k
        if k.startswith("arxiv:"):
            # arXiv-only publications are just puffed-up blog posts; don't dignify them.
            e.entry_type = "online"
        tmp2.append(e)

newlib = bibtexparser.Library(tmp2)
bibtexparser.write_file(str(REFAUTO), newlib)

Calendar

data = pd.date_range("2010-01-01", "2025-12-31")
data = pd.DataFrame({"date": data}, index=data).assign(
    year=data.year,
    month=data.month,
    day=data.day,
)
data = data.join(data["date"].dt.isocalendar().rename(columns=lambda s: f"week_{s}"))

data["Week"] = data["week_week"].copy()
data.loc[data["year"] > data["week_year"], "Week"] = 0
data.loc[data["year"] < data["week_year"], "Week"] = 54

data
WEEKDAYS = {1: "M", 2: "T", 3: "W", 4: "R", 5: "F", 6: "S", 7: "U"}
KNOWN_DATES = frozenset(docidx["published"].dt.date)

def format_date(dt):
    if dt.date() in KNOWN_DATES:
        return f'<a href="/log/{dt.year}/{dt.date()}">{dt.day}</a>'
    else:
        return str(dt.day)
disp = (
    data.set_index(["year", "week_day", "Week"])["date"]
    .unstack()
    .sort_index(ascending=[False, True])
)

classes = pd.DataFrame(data="", index=disp.index.copy(), columns=disp.columns.copy())

# Anywhere the month to the left is not the same as the current month, add a class
m = disp.map(lambda dt: dt.month).ffill(axis=1).bfill(axis=1)
mask = m != m.shift(1, axis=1)
mask.loc[:, 0] = False  # Ignore the first column
classes[mask] = classes[mask] + "month-change-left "

# Anywhere the month above is not the same as the current month, except for Mondays, add a class
mask = m != m.shift(1)
mask.loc[(slice(None), 1), :] = False
classes[mask] = classes[mask] + "month-change-above "

sty = disp.style
sty.index.names = ["", ""]
sty.columns.name = ""
sty.format(format_date, na_rep="")
sty.format_index(lambda x: WEEKDAYS[x], axis=0, level=1)
sty.set_td_classes(classes)
sty.set_table_attributes('class="masterlog"')

None
def write_calendar(root: Path, calhtml: str):
    outpath = root / "log" / "index.html"
    outpath.parent.mkdir(exist_ok=True, parents=True)
    # fmt: off
    cmd = [
        "pandoc",
        "--from", "html", "--to", "html5", "--standalone", "--wrap", "none",
        "--data-dir", PANDOCDATA,
        "--mathjax",
        "--metadata", "title=Log",
        "--metadata", "date=" + Timestamp.now().date().isoformat(),
        "--output", str(outpath),
        "-"
    ]
    # fmt: on
    proc = run(cmd, input=calhtml.encode("UTF-8"), check=True)
    return proc
write_calendar(OUT, sty.to_html())

Update the site directory

“Incremental updates:” Most of the content managed by Blarg is one-to-one — one source file goes to one site path. In the barest and cheapest of nods to efficiency, Blarg checks for existence of the target output file and compares modification times; if the output exists and is newer than the input, then skip. I think this does actually save real-world time because rendering an entry involves a Pandoc subprocess, which is more time consuming than a stat call.

(NB In an earlier version the check was using the output file's st_birthtime, but birth time is not updated if a file is overwritten in place, leading to a situation where the check should have skipped a file but did not.)

# Copy the site-wide CSS from the Pandoc templates directory to the site root. It lives in the
# Pandoc templates directory to prevent Pandoc from using its default CSS when generating HTML.
copyfile(Path(PANDOCDATA) / "templates" / "styles.css", OUT / "styles.css")

for _, entry in assidx.iterrows():
    outpath: Path = OUT / entry["outpath"]
    if outpath.exists() and entry["st_mtime"] <= outpath.stat().st_mtime:
        continue
    else:
        outpath.parent.mkdir(exist_ok=True, parents=True)
        copyfile(entry["path"], outpath)
def write_document_under(root: Path, doc: dict):
    outpath = root / doc["outpath"]
    outpath.parent.mkdir(exist_ok=True, parents=True)
    # fmt: off
    cmd = [
        "pandoc",
        "--from", DOCUMENT_MEDIATYPES[doc["mediatype"]],
        "--to", "html5", "--standalone", "--wrap", "none",

        "--data-dir", PANDOCDATA, "--mathjax",

        "--citeproc", "--bibliography", str(REFAUTO), "--bibliography", str(REF),
        "--csl", "chicago-fullnote-bibliography-short-title-subsequent.csl",

        "--filter", "blargify.py",
        "--lua-filter", "diagram.lua",

        "--extract-media=.",

        "--metadata", f"title={doc["title"]}",
        "--metadata", f"date={str(doc["published"].date())}",
        # "--metadata", f"editlink={doc["editlink"]}",

        "--output", str(outpath.name), str(entry["path"])
    ]
    # fmt: on
    with chdir(outpath.parent):
        proc = run(cmd, check=True)
    return proc
for _, entry in tqdm(docidx.iterrows(), total=len(docidx)):
    outpath: Path = OUT / entry["outpath"]
    if outpath.exists() and (entry["st_mtime"] < outpath.stat().st_mtime):
        continue
    else:
        write_document_under(OUT, entry)

Make an Atom feed

https://validator.w3.org/feed/docs/atom.html

FEED_HEADER = f"""\
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://danielgrady.net</id>
	<title>Daniel Grady’s web log</title>
	<subtitle>∇⋅∇𝒴</subtitle>
	<author>
		<name>Daniel Grady</name>
		<uri>https://danielgrady.net</uri>
	</author>
	<link href="https://danielgrady.net/atom.xml" rel="self"/>
	<link href="https://danielgrady.net" rel="alternate"/>
	<logo>https://danielgrady.net/favicon.ico</logo>
	<updated>{Timestamp.now(tz='US/Pacific').isoformat(timespec='seconds')}</updated>
"""

TODO Add the actual content of the entries to the feed.

ENTRY_TEMPLATE = """
	<entry>
		<id>{uri}</id>
		<title>{title}</title>
		<link rel="alternate" href="{uri}"/>
		<published>{published}</published>
		<updated>{updated}</updated>
	</entry>
"""
feeditems = docidx[lambda df: ~df["sitepath"].eq(Path("."))]
feeditems = feeditems.sort_values("published", ascending=False)

feed = FEED_HEADER

for _, entry in feeditems.iterrows():
    tmp = ENTRY_TEMPLATE.format(
        uri=f"{SITEURL}/{entry['sitepath']}",
        title=entry["title"],
        published=entry["published"].isoformat(timespec="seconds"),
        updated=entry["published"].isoformat(timespec="seconds"),
    )
    feed += tmp

feed += "</feed>"
with open(OUT / "atom.xml", "w", encoding="UTF-8") as f:
    f.write(feed)