Files
vim-ale/CAT-RULES.md
AJ ONeal 28dab7dade feat: complete classification of all 116 packages (169,867 rows)
- Add asset_filter/asset_exclude conf keys for shared-repo packages
- Split hugo/hugo-extended: exclude/require "extended" in asset name
- Add macosx, ia32, .snap, .appx classifier patterns
- Fix zig Platform.Size JSON string type (was int64, upstream sends string)
- Filter install scripts, cosign keys, compat.json as meta-assets
- Add riscv64, loong64, armv5, mipsle, mips64le to buildmeta

Full classification produces 169,867 distributable rows across 116 packages.
2026-03-10 00:27:57 -06:00

131 lines
5.1 KiB
Markdown

# Classification Rules & Learnings
Tracking classifier decisions and edge cases discovered during batch processing.
## Batch 1 (go, node, hugo, caddy, pathman)
### Vocabulary
All values in the CSV use buildmeta canonical names:
- Arch: `x86_64` (not amd64), `aarch64` (not arm64), `x86` (not 386/i386)
- Format: `.tar.gz` (with leading dot), matching buildmeta.Format constants
- OS: `darwin` (not mac/macos), `dragonfly` (not dragonflybsd)
### Classifier Additions
From this batch, these patterns were added to the generic classifier:
**OS:**
- `mac` (word boundary) → darwin (caddy uses `mac_amd64`)
- `openbsd`, `netbsd`, `dragonfly(?:bsd)?`, `plan9` → new OS types
- `.deb`/`.rpm` → infer Linux when OS undetectable from filename
**Arch:**
- `386` (word boundary) → x86 (Go naming convention)
- `32bit`/`64bit` (no hyphen) → x86/x86_64 (Hugo naming)
- `arm7` → armv7 (old caddy naming)
- `armhf` → armv7 (Debian convention)
- `armv5` → new arch type
- `universal` → universal2 (Hugo fat binaries)
- `riscv64`, `loong64`, `mipsle`, `mips64le` → new arch types
### Per-Source Normalizers
Each upstream API uses different naming. Normalizers convert to buildmeta vocabulary:
| Source | "amd64" | "arm64" | "arm" |
|-----------|-------------|-------------|---------|
| Go API | amd64→x86_64 | arm64→aarch64 | - |
| Node dist | x64→x86_64 | arm64→aarch64 | - |
| Zig | x86_64 (same)| aarch64 (same)| - |
| HashiCorp | amd64→x86_64 | arm64→aarch64 | arm→armv6 |
| Julia | x86_64 (same)| aarch64 (same)| - |
| Chrome | x64→x86_64 | arm64→aarch64 | - |
| MariaDB | x86_64 (same)| aarch64 (same)| - |
### Meta-Asset Filtering
Skipped patterns: checksums (.sha256, .sha512, .md5, checksums.txt),
signatures (.sig, .asc, .pem), SBOMs (.sbom, .spdx, .sigstore),
source archives (`_src.tar.gz`), and `buildable-artifact`.
### Edge Cases (Accepted)
- Caddy v2 beta bare binaries (`caddy2_beta12_macos`) — no arch in filename, shows empty
- Hugo `macOS-all` — means universal but only 2 files, not worth special-casing
- Hugo extended editions — detected via `extended` in filename, tracked in `extra` column? (TODO: not yet)
- Node "odd major = beta" heuristic — v15, v17, v19, v21, v23 are "current" not LTS
- Go version prefix: stripped `go` from `go1.23.6``1.23.6` for clean parsing
## Batch 2 (zig, flutter, chromedriver, terraform, julia, iterm2, mariadb, gpg, serviceman, aliasman)
### Zig Fetcher Fix
The zig upstream API returns `"size"` as a JSON string, not a number.
Changed `Platform.Size` from `int64` to `json.Number` to avoid unmarshal failures.
Also changed `Platforms` tag from `json:"-"` to `json:"platforms,omitempty"` so
platform data is preserved in cache.
### Source-Only Packages
serviceman and aliasman have GitHub releases with empty `assets:[]`. These are
source-only repos that install via `go install` or script download, not binary
releases. The classifier correctly produces 0 distributables for them — they
don't belong in the binary CSV.
### Flutter Arch Detection
Early Flutter releases (pre-2020) had no arch-specific builds — single
platform SDK. No arch in filename → empty arch in CSV. This is correct;
the installer would default to x86_64 on supported platforms.
## Batch 3 (25 packages: arc through gitdeploy)
### New Classifier Patterns
- `macosx` → darwin (syncthing uses `macosx`)
- `ia32` → x86 (dart-sass uses `ia32`)
- `.snap` format → Linux-only
- `.appx` format added for PowerShell
### New Meta-Asset Filters
- `.pub` (cosign keys)
- `install.sh`, `install.ps1` (install scripts)
- `compat.json` (syncthing metadata)
## Batch 4 (62 remaining packages) + Full Run
### Hugo/Hugo-Extended Split
hugo-extended shares the same GitHub repo as hugo. Added `asset_filter` and
`asset_exclude` conf keys to split them:
- `hugo/releases.conf`: `asset_exclude = extended` (6,354 assets)
- `hugo-extended/releases.conf`: `asset_filter = extended` (2,193 assets)
User direction: "hugo-extended should be a separate release. I believe the
README covered this. I think it should have been the default."
### Remaining Empty-Field Patterns (Per-Installer Territory)
These have empty OS or arch from the generic classifier and need per-installer
config to resolve:
- Git-for-Windows: `Git-2.x.x-32-bit.tar.bz2` — no OS in filename, always Windows
- CMake: HP-UX, IRIX targets — exotic/dead platforms
- Dashcore: old naming conventions
- Old PowerShell `.msi` files — no arch in filename
- Bare binaries (ollama-darwin, caddy2_beta12_macos) — no arch info
### Full Results
169,867 distributable rows across 116 packages.
3 packages produce 0 rows: serviceman, aliasman (source-only), duckdns.sh.
### TODO
- Consider whether bare binaries (no format extension) should get a format marker
- Per-installer configs for packages with known-but-undetectable OS/arch
- `arm32` classification: leave to per-installer unless pattern emerges
- `arm32` is vague — may mean armv6 or armv7. Leave as per-installer responsibility
unless a distinct pattern emerges (user direction 2026-03-10)