Files
vim-ale/CAT-RULES.md
AJ ONeal 28dab7dade feat: complete classification of all 116 packages (169,867 rows)
- Add asset_filter/asset_exclude conf keys for shared-repo packages
- Split hugo/hugo-extended: exclude/require "extended" in asset name
- Add macosx, ia32, .snap, .appx classifier patterns
- Fix zig Platform.Size JSON string type (was int64, upstream sends string)
- Filter install scripts, cosign keys, compat.json as meta-assets
- Add riscv64, loong64, armv5, mipsle, mips64le to buildmeta

Full classification produces 169,867 distributable rows across 116 packages.
2026-03-10 00:27:57 -06:00

5.1 KiB

Classification Rules & Learnings

Tracking classifier decisions and edge cases discovered during batch processing.

Batch 1 (go, node, hugo, caddy, pathman)

Vocabulary

All values in the CSV use buildmeta canonical names:

  • Arch: x86_64 (not amd64), aarch64 (not arm64), x86 (not 386/i386)
  • Format: .tar.gz (with leading dot), matching buildmeta.Format constants
  • OS: darwin (not mac/macos), dragonfly (not dragonflybsd)

Classifier Additions

From this batch, these patterns were added to the generic classifier:

OS:

  • mac (word boundary) → darwin (caddy uses mac_amd64)
  • openbsd, netbsd, dragonfly(?:bsd)?, plan9 → new OS types
  • .deb/.rpm → infer Linux when OS undetectable from filename

Arch:

  • 386 (word boundary) → x86 (Go naming convention)
  • 32bit/64bit (no hyphen) → x86/x86_64 (Hugo naming)
  • arm7 → armv7 (old caddy naming)
  • armhf → armv7 (Debian convention)
  • armv5 → new arch type
  • universal → universal2 (Hugo fat binaries)
  • riscv64, loong64, mipsle, mips64le → new arch types

Per-Source Normalizers

Each upstream API uses different naming. Normalizers convert to buildmeta vocabulary:

Source "amd64" "arm64" "arm"
Go API amd64→x86_64 arm64→aarch64 -
Node dist x64→x86_64 arm64→aarch64 -
Zig x86_64 (same) aarch64 (same) -
HashiCorp amd64→x86_64 arm64→aarch64 arm→armv6
Julia x86_64 (same) aarch64 (same) -
Chrome x64→x86_64 arm64→aarch64 -
MariaDB x86_64 (same) aarch64 (same) -

Meta-Asset Filtering

Skipped patterns: checksums (.sha256, .sha512, .md5, checksums.txt), signatures (.sig, .asc, .pem), SBOMs (.sbom, .spdx, .sigstore), source archives (_src.tar.gz), and buildable-artifact.

Edge Cases (Accepted)

  • Caddy v2 beta bare binaries (caddy2_beta12_macos) — no arch in filename, shows empty
  • Hugo macOS-all — means universal but only 2 files, not worth special-casing
  • Hugo extended editions — detected via extended in filename, tracked in extra column? (TODO: not yet)
  • Node "odd major = beta" heuristic — v15, v17, v19, v21, v23 are "current" not LTS
  • Go version prefix: stripped go from go1.23.61.23.6 for clean parsing

Batch 2 (zig, flutter, chromedriver, terraform, julia, iterm2, mariadb, gpg, serviceman, aliasman)

Zig Fetcher Fix

The zig upstream API returns "size" as a JSON string, not a number. Changed Platform.Size from int64 to json.Number to avoid unmarshal failures. Also changed Platforms tag from json:"-" to json:"platforms,omitempty" so platform data is preserved in cache.

Source-Only Packages

serviceman and aliasman have GitHub releases with empty assets:[]. These are source-only repos that install via go install or script download, not binary releases. The classifier correctly produces 0 distributables for them — they don't belong in the binary CSV.

Flutter Arch Detection

Early Flutter releases (pre-2020) had no arch-specific builds — single platform SDK. No arch in filename → empty arch in CSV. This is correct; the installer would default to x86_64 on supported platforms.

Batch 3 (25 packages: arc through gitdeploy)

New Classifier Patterns

  • macosx → darwin (syncthing uses macosx)
  • ia32 → x86 (dart-sass uses ia32)
  • .snap format → Linux-only
  • .appx format added for PowerShell

New Meta-Asset Filters

  • .pub (cosign keys)
  • install.sh, install.ps1 (install scripts)
  • compat.json (syncthing metadata)

Batch 4 (62 remaining packages) + Full Run

Hugo/Hugo-Extended Split

hugo-extended shares the same GitHub repo as hugo. Added asset_filter and asset_exclude conf keys to split them:

  • hugo/releases.conf: asset_exclude = extended (6,354 assets)
  • hugo-extended/releases.conf: asset_filter = extended (2,193 assets)

User direction: "hugo-extended should be a separate release. I believe the README covered this. I think it should have been the default."

Remaining Empty-Field Patterns (Per-Installer Territory)

These have empty OS or arch from the generic classifier and need per-installer config to resolve:

  • Git-for-Windows: Git-2.x.x-32-bit.tar.bz2 — no OS in filename, always Windows
  • CMake: HP-UX, IRIX targets — exotic/dead platforms
  • Dashcore: old naming conventions
  • Old PowerShell .msi files — no arch in filename
  • Bare binaries (ollama-darwin, caddy2_beta12_macos) — no arch info

Full Results

169,867 distributable rows across 116 packages. 3 packages produce 0 rows: serviceman, aliasman (source-only), duckdns.sh.

TODO

  • Consider whether bare binaries (no format extension) should get a format marker
  • Per-installer configs for packages with known-but-undetectable OS/arch
  • arm32 classification: leave to per-installer unless pattern emerges
  • arm32 is vague — may mean armv6 or armv7. Leave as per-installer responsibility unless a distinct pattern emerges (user direction 2026-03-10)