feat(ops): implement surgical V2 garbage collection and VIP inheritance protocols

2026-07-28 09:32:20 +00:00 · 2026-05-17 19:24:49 +02:00
parent c0d50fb0ca
commit 0f1f86528c
2 changed files with 9 additions and 4 deletions
--- a/GEMINI.md
+++ b/GEMINI.md
@@ -23,7 +23,9 @@ This file contains the accumulated instructions and long-term vision for the aut
    - **Explicit Language Tagging**: All non-English resources in the V2 Portal MUST be explicitly tagged (e.g., `[SPANISH CONTENT]`, `[FRENCH CONTENT]`) at the end of the entry to inform global users before they navigate.
    - **English-First Exceptions**: Global software projects (even if created by Spanish speakers) that use English as their primary interface should be curated entirely in English. Native preservation is for localized content like blogs, videos, and guides.
 11. **Workflow-Config Synchronization**: The GitHub Actions curation workflow form (`agentic_cron.yml`) MUST remain perfectly synchronized with the curation sources configuration file (`data/curation_sources.yaml`). Any addition, removal, or renaming of topics/categories in the configuration file requires a corresponding update to the workflow's input fields (checkboxes) to ensure users can toggle those sources manually. This maintains consistency between data-driven sources and the UI trigger.
-12. **V2 Elite Maintenance**: The Nubenetes V2 (Agentic Elite) edition is a derived view of the V1 archive. It is managed via the `src/v2_optimizer.py` script and stored in the `v2-docs/` directory. The `agentic_v2_builder.yml` workflow synchronizes V2 automatically whenever V1 (`docs/`) is updated (manually or via bot). Standard curation and cleaning workflows must always target the `docs/` directory as the primary source of truth.
+12. **V2 Elite Maintenance**: The Nubenetes V2 (Agentic Elite) edition is a derived view of the V1 archive. It is managed via the `src/v2_optimizer.py` script and stored in the `v2-docs/` directory.
+    - **Surgical Cleanup**: The optimizer MUST perform surgical garbage collection in `v2-docs/` after each run, deleting only orphaned files that are no longer part of the current site architecture.
+    - **Synchronization**: V2 is updated automatically whenever V1 (`docs/`) changes. Standard curation always targets V1 as the source of truth.
 13. **Detailed Logging for V2**: When running the V2 Optimizer, agents MUST use unbuffered logging and detailed output messages. If the optimizer returns '0 links kept', the agent MUST investigate the logs to determine if it was due to AI selection or a parsing/API error.
 14. **Persistent V2 Caching**: The V2 Optimizer MUST use a persistent cache file (`data/centralized YAML inventory`) to store AI evaluations (year, quality, category). This is mandatory to minimize API costs and ensure execution speed across 15k+ links.
 15. **GitHub Metadata Enrichment**: For all `github.com` resources, the bot MUST attempt to fetch real-time metadata (stars, last commit) using the GitHub API. This data must be included in the V2 rendering to provide current context.
@@ -77,8 +79,9 @@ This file contains the accumulated instructions and long-term vision for the aut
    - **Resilient Fallback**: Automatically transition between models and API keys upon encountering 404 (Unsupported) or 429 (Rate Limit) errors.

 27. **Special Assets Management (V1 & V2)**: High-value files defined in [`data/special_assets.yaml`](data/special_assets.yaml) require specialized handling:
+    - **VIP Status Inheritance**: During project consolidation (semantic dedup), if any link instance originates from a Special Asset, the consolidated entry MUST inherit the protected `is_special` status.
    - **High-Precision Reorganization (V1)**: These files MUST use nested semantic grouping (## and ###) to organize links without ever deleting technically valid content.
-    - **Exhaustive Inclusion (V2)**: Unlike standard categories, V2 pages for Special Assets MUST include 100% of the ALIVE links from V1.
+    - **Exhaustive Inclusion (V2)**: Unlike standard categories, V2 pages for Special Assets MUST include 100% of the ALIVE links from V1, bypassing standard impact filters.
    - **AI Curation Discovery**: The discovery engine MUST actively search for new high-quality curation sources (e.g., "Awesome" repos) and suggest them for inclusion in `curation_sources.yaml`.
 28. **Sophisticated V2 Knowledge Architecture**: The V2 Portal MUST be structured like an advanced O'Reilly technical book:
    - **Deep Hierarchical Classification**: Resources MUST be organized using the `hierarchy` metadata field (a list of up to 10 strings: Area > Topic > Subtopics). This structure is mandatory for both V1 reorganization and V2 generation to ensure perfect consistency.
--- a/README.md
+++ b/README.md
@@ -308,7 +308,8 @@ To maximize economic efficiency, all AI agents follow a **Database-First** appro

 ### 6.3. Database Lifecycle and Hygiene
 To maintain a high-performance "Single Source of Truth", Nubenetes implements automated hygiene protocols:
- **Incremental Self-Correction**: The engine autonomously identifies "suspicious" resources in the database (e.g., deep technical links that have defaulted to generic homepages or "About" sections). During standard maintenance runs, these links are prioritized for re-validation and the **Universal Rescue Protocol**, allowing the system to repair past precision errors incrementally without requiring a full `FORCE_FULL_CHECK`.
+- **Surgical Asset Pruning (V2)**: The V2 generation engine follows a "Zero-Zombie" policy. Instead of aggressive mass deletion, it tracks all valid dimension files and surgically prunes only the orphaned Markdown files in `v2-docs/` that are no longer part of the current architecture.
+- **Incremental Self-Correction**: The engine autonomously identifies "suspicious" resources in the database (e.g., deep technical links that have defaulted to generic homepages). During standard maintenance runs, these links are prioritized for re-validation and the **Universal Rescue Protocol**, allowing the system to repair past precision errors incrementally without requiring a full `FORCE_FULL_CHECK`.
 - **Physical File Synchronization**: During the health check cycle, the engine performs **surgical line-by-line updates** on the V1 Markdown files. Dead links are physically removed, and permanent redirections (301/302) are updated to their **Canonical URLs**, ensuring the repository remains clean and low-latency.
 - **Semantic Drift Detection**: Using **SHA256 Content Fingerprinting**, the system monitors for silent updates. If resource content changes significantly, it is flagged for AI re-evaluation to refresh its summary and impact score.
 - **GitHub Branch Auto-Heal**: If a deep link returns a 404, the engine automatically attempts to rescue it by migrating the path from `master` to `main`. Verified revivals are automatically updated in the V1 archive.
@@ -450,7 +451,8 @@ graph TD
 ```

 ### 7.6. Strategic Benefits
- **Incremental Self-Correction**: The engine proactively repairs historical precision errors (such as generic redirects) during standard maintenance cycles, ensuring the archive's quality improves over time without the need for exhaustive re-runs.
+- **VIP Status Inheritance**: During deep semantic deduplication, if ANY instance of a technical project originates from a **Special Asset** (VIP file), the consolidated authoritative entry inherits that protected status. This ensures that critical links from foundations or awesome lists are never filtered out by impact thresholds during project consolidation.
+- **Incremental Self-Correction**: The engine proactively repairs historical precision errors (such as generic redirects) during standard maintenance cycles, ensuring the archive's quality improves over time.
 - **Content-URL Precision Standard (Mandate 31)**: AI agents automatically detect **Generic Redirects** (e.g., deep technical links redirecting to home pages). For ALL resources, the system triggers a **Universal Rescue Protocol**, using Gemini to find the specific content's new location on the destination domain. Only if no technical equivalent is found is the link removed, ensuring technical coherence and zero misinformation across site migrations (e.g., Nginx to F5).
 - **Universal Title and TOC Standards (Mandate 30)**: All technical titles and indices are programmatically sanitized to remove emojis and ampersands, ensuring 100% robust internal Markdown links and cross-platform rendering stability.
 - **Platinum Lifecycle Management**: The system implements advanced data engineering fields including **SHA256 Content Fingerprinting** (to detect silent content drift), **Health Reliability Scoring** (0-100 EMA), and **Source Provenance Tracking**.