GitHub has two archive formats:
- legacy: codeload.github.com/.../legacy.tar.gz/... → Owner-Repo-Hash/
- current: github.com/.../archive/refs/tags/TAG.tar.gz → repo-version/
The API's tarball_url redirects to the legacy format. Node.js follows
this redirect. The current format is cleaner: predictable filenames
(repo-version.tar.gz), consistent directory names (repo-version/),
and standard github.com URLs.
Verified: aliasman-1.1.2.tar.gz extracts to aliasman-1.1.2/ which
matches the install script glob (mv ./*aliasman*/aliasman ...).
Use Owner-Repo-Tag naming (e.g. BeyondCodeBootcamp-aliasman-v1.1.2.tar.gz)
and direct codeload.github.com URLs instead of api.github.com tarball_url.
This matches the Node.js behavior for source-only packages (aliasman,
duckdns.sh, serviceman) where the extracted directory name matters for
install script globbing (mv ./*aliasman*/ ...).
Remaining diff: Node.js follows the redirect to get the git short hash
suffix (-0-g{hash}) from Content-Disposition. Go uses the tag name
directly. Both resolve to the same archive content.
- Add -src.{tar.gz,tar.xz,zip} pattern to isMetaAsset (alongside _src.)
- Set os=posix_2017, arch=* on source archives (no-binary-asset releases)
instead of leaving them empty. These are shell scripts/vim plugins that
work on any POSIX system.
- Remove "source" Extra tag from source archives (os/arch tells the story)
Add fetch + classify functions for all custom source types:
- chromedist (chromedriver): Chrome for Testing JSON index
- flutterdist (flutter): Google Storage per-OS release indexes
- golang (go): golang.org/dl JSON API
- gpgdist (gpg): SourceForge RSS scraping
- hashicorp (terraform): releases.hashicorp.com product index
- iterm2dist (iterm2): HTML scraping of downloads page
- juliadist (julia): S3 versions.json with platform files
- mariadbdist (mariadb): two-step REST API (majors → releases)
- zigdist (zig): mixed-schema JSON with platform keys
All 9 fetcher packages already existed in internal/releases/ but
were not wired into webicached's fetchRaw/classifyPackage switches.
Now all 103 packages produce classified cache output.
- cmd/comparecache: compares Go cache vs Node.js LIVE_cache at filename
level, categorizes differences (meta-filtering, version depth, source
tarballs, unsupported sources, real asset differences)
- COMPARISON.md: per-package checklist with 91 live packages categorized
- webicached: add -no-fetch flag to classify from existing raw data only
- GO_WEBI.md: update Phase 1 checkboxes for completed items
Combines fetch + classify + write into one pipeline:
1. Reads releases.conf to discover packages
2. Fetches raw upstream data to rawcache
3. Classifies assets (OS, arch, libc, format)
4. Applies config transforms (exclude, version prefix strip)
5. Writes to fsstore in Node.js-compatible _cache/ format
Supports github, nodedist, gittag, and gitea sources. Other sources
(golang, zigdist, flutter, etc.) are skipped with a log message —
they'll be added as needed.
Can run as a one-shot (-once) or periodic daemon (-interval 15m).
Replace conf.Get("key") and conf.Source() calls with direct struct
field access (conf.Owner, conf.Repo, conf.TagPrefix, conf.BaseURL,
conf.Source) and conf.Extra["key"] for non-standard keys.
Add cmd/uaparse — analyzes User-Agent strings from webi.sh logs,
deduplicates by (os, arch, libc), extracts platform hints (cloud
provider, container runtime, distro), and flags malformed UAs.
Fix uadetect issues discovered by running against 2,186 live UAs:
- Msys/MINGW/Cygwin now correctly detected as Windows (was Linux)
- FreeBSD detection added
- s390x and riscv64 arch detection added
- WSL libc no longer falsely detected as MSVC ("microsoft" in kernel
version string was triggering the MSVC check)
.tgz is a legitimate archive format (used by ollama darwin releases).
Remove it from the meta-asset filter and add a .tgz → .tar.gz mapping
in detectFormat.
internal/resolve: picks the best release for a platform query.
Handles arch compatibility fallbacks (Rosetta 2, Windows ARM64
emulation, amd64 micro-arch levels), format preferences, variant
filtering (prefers base over rocm/jetpack GPU variants), and
universal (arch-less) binaries.
cmd/e2etest: fetches releases for goreleaser, ollama, and node,
classifies them, resolves for 9 test queries across linux/darwin/
windows x86_64/arm64, then compares against the live webi.sh API.
Results: 8/9 exact match, 1 warn where the Go resolver is more
correct than the live API (ollama arm64 base vs jetpack variant).
Edge cases fixed during development:
- .tgz is a valid archive format (not npm metadata)
- Empty arch in filename = universal binary (ranked below native)
- GPU variants (rocm, jetpack) ranked below base binaries
When a GitHub release has no binary assets, fall back to tarball_url and
zipball_url. These are source distributions (platform-independent), marked
with extra=source.
- serviceman: 12 distributables (6 releases × tar.gz + zip)
- aliasman: 8 distributables (4 releases × tar.gz + zip)
- duckdns.sh: 6 distributables (3 releases × tar.gz + zip)
Total: 170,213 rows across 116 packages (no more zeros).
- .app.zip and .dmg formats now infer darwin OS when absent
- Filter .tgz (npm packages) and .d.ts (TypeScript defs) as meta-assets
- Reduces bun false positives by 64, deno by 294
- Add cmd/classify: reads raw cached releases and produces a CSV of all
distributables with sortable version columns (ver_major/minor/patch/pre)
- Export rawcache.ActivePath() for use by cmd/classify
- Add OS detection: openbsd, netbsd, dragonflybsd, plan9, mac→darwin
- Add arch detection: armv5, armhf→armv7, arm7→armv7, 386→x86,
32bit/64bit (no hyphen), universal→universal2, riscv64, loong64,
mipsle, mips64le
- Infer Linux from .deb/.rpm format when OS not in filename
- Add .deb and .rpm as recognized formats
- Normalize all per-source values to buildmeta vocabulary (x86_64, aarch64)
- Filter source archives and buildable-artifact meta-assets
- Add CAT-RULES.md tracking classifier learnings
- Add CATEGORIZED.md and LINKS.md for reference
Batch 1 tested: go, node, hugo, caddy, pathman (35,919 rows)
New fetcher packages:
- chromedist: Chrome for Testing API (googlechromelabs.github.io)
- gpgdist: SourceForge RSS for GPG macOS
- mariadbdist: MariaDB downloads REST API
New releases.conf files for:
- GitHub: aliasman, awless, duckdns.sh, hugo-extended, kubens, rg, postgres
- gittag: vim-commentary, vim-zig
- gitea: pathman
- chromedist: chromedriver
- gpgdist: gpg
- mariadbdist: mariadb
- nodedist: node
Alias support (alias_of key):
- golang → go, dashd → dashcore, psql → postgres, zig.vim → vim-zig
- Aliases skip fetching and share cache with their target
Every package with a releases.js now has a releases.conf (except the
dead macos package). fetchraw dispatches to all 13 source types.
New fetcher packages:
- golang: golang.org/dl/?mode=json&include=all
- zigdist: ziglang.org/download/index.json
- flutterdist: Google Storage per-OS release indexes
- iterm2dist: scrapes iterm2.com/downloads.html
- hashicorp: releases.hashicorp.com/{product}/index.json
- juliadist: julialang-s3.julialang.org/bin/versions.json
Each follows the same iter.Seq2 pattern as the existing nodedist/github
fetchers. Added releases.conf files for all six packages and wired them
into cmd/fetchraw.
Fixed latest-version detection for sources that return unordered data
(hashicorp, zigdist, juliadist) by comparing all versions with lexver
instead of taking the first stable one found.
The discover() function now skips directories starting with _ (like
_example, _webi, _common) so infrastructure dirs aren't treated as
packages to fetch.
Discovers packages by globbing {confDir}/*/releases.conf. Adding a
new package is now just creating a conf file — no Go code changes.
Dispatches to the right fetcher based on source= (github, nodedist).
- rawcache: add Merge() that skips unchanged releases, logs added/
changed events to an append-only JSONL audit log with SHA-256
- rawcache: drop .json extension from filenames — raw cache stores
opaque bytes (upstream may be JSON, CSV, XML, or bespoke)
- fetchraw: add all 68 GitHub packages, use Merge instead of Put
- fetchraw: log format shows +added ~changed =skipped
Put directly into the active slot instead of BeginRefresh. Existing
releases are skipped (Has check), new ones are added, _latest is
only updated if the candidate is newer. Safe to run repeatedly —
backports and delayed releases accumulate without losing history.
Fetches complete release histories from upstream APIs and stores
them in rawcache. Supports GitHub (with pagination, auth, monorepo
tag prefix filtering) and Node.js dist API (official + unofficial
as separate caches to avoid version collisions).
Tested with: node-official (834), node-unofficial (387),
hugo (365), caddy (134), monorel (3).