Compare commits

...

12 Commits
v5.0.1 ... main

Author SHA1 Message Date
Paige Patton
71bd34b020 adding better logging for when sceanrio file cant be found (#1203)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-03-27 13:47:49 -04:00
Paige Patton
6da7c9dec6 adding governance template from cncf (#926)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-03-27 09:33:00 -04:00
Tullio Sebastiani
4d5aea146d Run method fixes (#1202)
* kubevirt plugin fixes

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* managed_cluster plugin fixes

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* unit tests fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2026-03-27 14:31:19 +01:00
Yashasvi Yadav
62f500fb2e feat: add GCP zone outage rollback support (#1200)
Add rollback functionality for GCP zone outage scenarios following the
established rollback pattern (Service Hijacking, PVC, Syn Flood).

- Add @set_rollback_context_decorator to run()
- Set rollback callable before stopping nodes with base64/JSON encoded data
- Add rollback_gcp_zone_outage() static method with per-node error handling
- Fix missing poll_interval argument in starmap calls
- Add unit tests for rollback and run methods

Closes #915

Signed-off-by: YASHASVIYADAV30 <yashasviydv30@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
2026-03-26 14:42:45 -04:00
Arpit Raj
ec241d35d6 fix: improve logging reliability and code quality (#1199)
- Fix typo 'wating' -> 'waiting' in scenario wait log message
- Replace print() with logging.debug() for pod metrics in prometheus client
- Replace star import with explicit imports in utils/__init__.py
- Remove unnecessary global declaration in main()
- Log VM status exceptions at ERROR level with exception details

Include unit tests in tests/test_logging_and_code_quality.py covering all fixes.

Signed-off-by: 1PoPTRoN <vrxn.arp1traj@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
2026-03-26 13:08:56 -04:00
Arpit Raj
59e10d5a99 fix: bind exception variable in except handlers to prevent NameError (#1198)
Signed-off-by: 1PoPTRoN <vrxn.arp1traj@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
2026-03-26 09:43:37 -04:00
Paige Patton
c8aa959df2 controller -> detailed (#1201)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-03-26 08:47:06 -04:00
Paige Patton
3db5e1abbe no rebuild image (#1197)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-03-20 12:54:45 -04:00
Paige Patton
1e699c6cc9 different quay users (#1196)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-03-20 17:30:42 +01:00
Paige Patton
0ebda3e101 test multi platform (#1194)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-03-20 11:09:33 -04:00
Tullio Sebastiani
8a5be0dd2f Resiliency Score krknctl compatibility fixes (#1195)
* added console log of the resiliency score when mode is "detailed"

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* base image krknctl input

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

resiliency score flag

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* removed json print in run_krkn.py

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* unit test fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2026-03-20 11:09:07 -04:00
Tullio Sebastiani
62dadfe25c Resiliency Score krknctl compatibility fixes (#1195)
* added console log of the resiliency score when mode is "detailed"

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* base image krknctl input

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

resiliency score flag

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* removed json print in run_krkn.py

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* unit test fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2026-03-20 11:08:56 -04:00
22 changed files with 1092 additions and 148 deletions

View File

@@ -6,48 +6,117 @@ on:
jobs:
build:
runs-on: ubuntu-latest
runs-on: ${{ matrix.runner }}
strategy:
matrix:
include:
- platform: amd64
runner: ubuntu-latest
- platform: arm64
runner: ubuntu-24.04-arm
steps:
- name: Check out code
uses: actions/checkout@v3
- name: Build the Docker images
if: startsWith(github.ref, 'refs/tags')
run: |
./containers/compile_dockerfile.sh
docker build --no-cache -t quay.io/krkn-chaos/krkn containers/ --build-arg TAG=${GITHUB_REF#refs/tags/}
docker tag quay.io/krkn-chaos/krkn quay.io/redhat-chaos/krkn
docker tag quay.io/krkn-chaos/krkn quay.io/krkn-chaos/krkn:${GITHUB_REF#refs/tags/}
docker tag quay.io/krkn-chaos/krkn quay.io/redhat-chaos/krkn:${GITHUB_REF#refs/tags/}
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Test Build the Docker images
if: ${{ github.event_name == 'pull_request' }}
if: github.event_name == 'pull_request'
run: |
./containers/compile_dockerfile.sh
docker build --no-cache -t quay.io/krkn-chaos/krkn containers/ --build-arg PR_NUMBER=${{ github.event.pull_request.number }}
- name: Login in quay
docker buildx build --no-cache \
--platform linux/${{ matrix.platform }} \
-t quay.io/krkn-chaos/krkn \
-t quay.io/redhat-chaos/krkn \
containers/ \
--build-arg PR_NUMBER=${{ github.event.pull_request.number }}
- name: Login to krkn-chaos quay
if: startsWith(github.ref, 'refs/tags')
run: docker login quay.io -u ${QUAY_USER} -p ${QUAY_TOKEN}
env:
QUAY_USER: ${{ secrets.QUAY_USERNAME }}
QUAY_TOKEN: ${{ secrets.QUAY_PASSWORD }}
- name: Push the KrknChaos Docker images
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ secrets.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push krkn-chaos images
if: startsWith(github.ref, 'refs/tags')
run: |
docker push quay.io/krkn-chaos/krkn
docker push quay.io/krkn-chaos/krkn:${GITHUB_REF#refs/tags/}
- name: Login in to redhat-chaos quay
if: startsWith(github.ref, 'refs/tags/v')
run: docker login quay.io -u ${QUAY_USER} -p ${QUAY_TOKEN}
env:
QUAY_USER: ${{ secrets.QUAY_USER_1 }}
QUAY_TOKEN: ${{ secrets.QUAY_TOKEN_1 }}
- name: Push the RedHat Chaos Docker images
./containers/compile_dockerfile.sh
TAG=${GITHUB_REF#refs/tags/}
docker buildx build --no-cache \
--platform linux/${{ matrix.platform }} \
--provenance=false \
-t quay.io/krkn-chaos/krkn:latest-${{ matrix.platform }} \
-t quay.io/krkn-chaos/krkn:${TAG}-${{ matrix.platform }} \
containers/ \
--build-arg TAG=${TAG} \
--push --load
- name: Login to redhat-chaos quay
if: startsWith(github.ref, 'refs/tags')
run: |
docker push quay.io/redhat-chaos/krkn
docker push quay.io/redhat-chaos/krkn:${GITHUB_REF#refs/tags/}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ secrets.QUAY_USER_1 }}
password: ${{ secrets.QUAY_TOKEN_1 }}
- name: Push redhat-chaos images
if: startsWith(github.ref, 'refs/tags')
run: |
TAG=${GITHUB_REF#refs/tags/}
docker tag quay.io/krkn-chaos/krkn:${TAG}-${{ matrix.platform }} quay.io/redhat-chaos/krkn:${TAG}-${{ matrix.platform }}
docker tag quay.io/krkn-chaos/krkn:${TAG}-${{ matrix.platform }} quay.io/redhat-chaos/krkn:latest-${{ matrix.platform }}
docker push quay.io/redhat-chaos/krkn:${TAG}-${{ matrix.platform }}
docker push quay.io/redhat-chaos/krkn:latest-${{ matrix.platform }}
manifest:
runs-on: ubuntu-latest
needs: build
if: startsWith(github.ref, 'refs/tags')
steps:
- name: Login to krkn-chaos quay
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ secrets.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Create and push KrknChaos manifests
run: |
TAG=${GITHUB_REF#refs/tags/}
docker manifest create quay.io/krkn-chaos/krkn:${TAG} \
quay.io/krkn-chaos/krkn:${TAG}-amd64 \
quay.io/krkn-chaos/krkn:${TAG}-arm64
docker manifest push quay.io/krkn-chaos/krkn:${TAG}
docker manifest create quay.io/krkn-chaos/krkn:latest \
quay.io/krkn-chaos/krkn:latest-amd64 \
quay.io/krkn-chaos/krkn:latest-arm64
docker manifest push quay.io/krkn-chaos/krkn:latest
- name: Login to redhat-chaos quay
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ secrets.QUAY_USER_1 }}
password: ${{ secrets.QUAY_TOKEN_1 }}
- name: Create and push RedHat Chaos manifests
run: |
TAG=${GITHUB_REF#refs/tags/}
docker manifest create quay.io/redhat-chaos/krkn:${TAG} \
quay.io/redhat-chaos/krkn:${TAG}-amd64 \
quay.io/redhat-chaos/krkn:${TAG}-arm64
docker manifest push quay.io/redhat-chaos/krkn:${TAG}
docker manifest create quay.io/redhat-chaos/krkn:latest \
quay.io/redhat-chaos/krkn:latest-amd64 \
quay.io/redhat-chaos/krkn:latest-arm64
docker manifest push quay.io/redhat-chaos/krkn:latest
- name: Rebuild krkn-hub
if: startsWith(github.ref, 'refs/tags')
uses: redhat-chaos/actions/krkn-hub@main
with:
QUAY_USER: ${{ secrets.QUAY_USERNAME }}

View File

@@ -1,83 +1,148 @@
# Krkn Project Governance
Krkn is a chaos and resiliency testing tool for Kubernetes that injects deliberate failures into clusters to validate their resilience under turbulent conditions. This governance document explains how the project is run.
- [Values](#values)
- [Community Roles](#community-roles)
- [Becoming a Maintainer](#becoming-a-maintainer)
- [Removing a Maintainer](#removing-a-maintainer)
- [Meetings](#meetings)
- [CNCF Resources](#cncf-resources)
- [Code of Conduct](#code-of-conduct)
- [Security Response Team](#security-response-team)
- [Voting](#voting)
- [Modifying this Charter](#modifying-this-charter)
The governance model adopted here is heavily influenced by a set of CNCF projects, especially drew
reference from [Kubernetes governance](https://github.com/kubernetes/community/blob/master/governance.md).
*For similar structures some of the same wordings from kubernetes governance are borrowed to adhere
to the originally construed meaning.*
## Values
## Principles
Krkn and its leadership embrace the following values:
- **Open**: Krkn is open source community.
- **Welcoming and respectful**: See [Code of Conduct](https://github.com/cncf/foundation/blob/master/code-of-conduct.md).
- **Transparent and accessible**: Work and collaboration should be done in public.
Changes to the Krkn organization, Krkn code repositories, and CNCF related activities (e.g.
level, involvement, etc) are done in public.
- **Merit**: Ideas and contributions are accepted according to their technical merit
and alignment with project objectives, scope and design principles.
* **Openness**: Communication and decision-making happens in the open and is discoverable for future reference. As much as possible, all discussions and work take place in public forums and open repositories.
* **Fairness**: All stakeholders have the opportunity to provide feedback and submit contributions, which will be considered on their merits.
* **Community over Product or Company**: Sustaining and growing our community takes priority over shipping code or sponsors' organizational goals. Each contributor participates in the project as an individual.
* **Inclusivity**: We innovate through different perspectives and skill sets, which can only be accomplished in a welcoming and respectful environment.
* **Participation**: Responsibilities within the project are earned through participation, and there is a clear path up the contributor ladder into leadership positions.
## Community Roles
Krkn uses a tiered contributor model. Each level comes with increasing responsibilities and privileges.
### Contributor
Anyone can become a contributor by participating in discussions, reporting bugs, or submitting code or documentation.
**Responsibilities:**
- Adhere to the [Code of Conduct](CODE_OF_CONDUCT.md)
- Report bugs and suggest new features
- Contribute high-quality code and documentation
### Member
Members are active contributors who have demonstrated a solid understanding of the project's codebase and conventions.
**Responsibilities:**
- Review pull requests for correctness, quality, and adherence to project standards
- Provide constructive and timely feedback to contributors
- Ensure contributions are well-tested and documented
- Work with maintainers to support a smooth release process
### Maintainer
Maintainers are responsible for the overall health and direction of the project. They have write access to the [project GitHub repository](https://github.com/krkn-chaos/krkn) and can merge patches from themselves or others. The current maintainers are listed in [MAINTAINERS.md](./MAINTAINERS.md).
Maintainers collectively form the **Maintainer Council**, the governing body for the project.
A maintainer is not just someone who can make changes — they are someone who has demonstrated the ability to collaborate with the team, get the right people to review code and docs, contribute high-quality work, and follow through to fix issues.
**Responsibilities:**
- Set the technical direction and vision for the project
- Manage releases and ensure stability of the main branch
- Make decisions on feature inclusion and project priorities
- Mentor contributors and help grow the community
- Resolve disputes and make final decisions when consensus cannot be reached
### Owner
Owners have administrative access to the project and are the final decision-makers.
**Responsibilities:**
- Manage the core team of maintainers
- Set the overall vision and strategy for the project
- Handle administrative tasks such as managing the repository and other resources
- Represent the project in the broader open-source community
## Becoming a Maintainer
To become a Maintainer you need to demonstrate the following:
- **Commitment to the project:**
- Participate in discussions, contributions, code and documentation reviews for 3 months or more
- Perform reviews for at least 5 non-trivial pull requests
- Contribute at least 3 non-trivial pull requests that have been merged
- Ability to write quality code and/or documentation
- Ability to collaborate effectively with the team
- Understanding of how the team works (policies, processes for testing and code review, etc.)
- Understanding of the project's codebase and coding and documentation style
A new Maintainer must be proposed by an existing Maintainer by sending a message to the [maintainer mailing list](mailto:krkn.maintainers@gmail.com). A simple majority vote of existing Maintainers approves the application. Nominations will be evaluated without prejudice to employer or demographics.
Maintainers who are approved will be granted the necessary GitHub rights and invited to the [maintainer mailing list](mailto:krkn.maintainers@gmail.com).
## Removing a Maintainer
Maintainers may resign at any time if they feel they will not be able to continue fulfilling their project duties.
Maintainers may also be removed for inactivity, failure to fulfill their responsibilities, violating the Code of Conduct, or other reasons. Inactivity is defined as a period of very low or no activity in the project for a year or more, with no definite schedule to return to full Maintainer activity.
A Maintainer may be removed at any time by a 2/3 vote of the remaining Maintainers.
Depending on the reason for removal, a Maintainer may be converted to **Emeritus** status. Emeritus Maintainers will still be consulted on some project matters and can be rapidly returned to Maintainer status if their availability changes.
## Meetings
Maintainers are expected to participate in the public developer meeting, which occurs **once a month via Zoom**. Meeting details (link, agenda, and notes) are posted in the [#krkn channel on Kubernetes Slack](https://kubernetes.slack.com/messages/C05SFMHRWK1) prior to each meeting.
Maintainers will also hold closed meetings to discuss security reports or Code of Conduct violations. Such meetings should be scheduled by any Maintainer on receipt of a security issue or CoC report. All current Maintainers must be invited to such closed meetings, except for any Maintainer who is accused of a CoC violation.
## CNCF Resources
Any Maintainer may suggest a request for CNCF resources, either on the [mailing list](mailto:krkn.maintainers@gmail.com) or during a monthly meeting. A simple majority of Maintainers approves the request. The Maintainers may also choose to delegate working with the CNCF to non-Maintainer community members, who will then be added to the [CNCF's Maintainer List](https://github.com/cncf/foundation/blob/main/project-maintainers.csv) for that purpose.
## Code of Conduct
Krkn follows the [CNCF Code of Conduct](https://github.com/cncf/foundation/blob/master/code-of-conduct.md).
Here is an excerpt:
> As contributors and maintainers of this project, and in the interest of fostering an open and welcoming community, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
> As contributors and maintainers of this project, and in the interest of fostering an open and welcoming community, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
## Maintainer Levels
Code of Conduct violations by community members will be discussed and resolved on the [private maintainer mailing list](mailto:krkn.maintainers@gmail.com). If a Maintainer is directly involved in the report, two Maintainers will instead be designated to work with the CNCF Code of Conduct Committee in resolving it.
### Contributor
Contributors contribute to the community. Anyone can become a contributor by participating in discussions, reporting bugs, or contributing code or documentation.
## Security Response Team
#### Responsibilities:
The Maintainers will appoint a Security Response Team to handle security reports. This committee may consist of the Maintainer Council itself. If this responsibility is delegated, the Maintainers will appoint a team of at least two contributors to handle it. The Maintainers will review the composition of this team at least once a year.
Be active in the community and adhere to the Code of Conduct.
The Security Response Team is responsible for handling all reports of security holes and breaches according to the [security policy](SECURITY.md).
Report bugs and suggest new features.
To report a security vulnerability, please follow the process outlined in [SECURITY.md](SECURITY.md) rather than filing a public GitHub issue.
Contribute high-quality code and documentation.
## Voting
While most business in Krkn is conducted by "[lazy consensus](https://community.apache.org/committers/lazyConsensus.html)", periodically the Maintainers may need to vote on specific actions or changes. Any Maintainer may demand a vote be taken.
### Member
Members are active contributors to the community. Members have demonstrated a strong understanding of the project's codebase and conventions.
Votes on general project matters may be raised on the [maintainer mailing list](mailto:krkn.maintainers@gmail.com) or during a monthly meeting. Votes on security vulnerabilities or Code of Conduct violations must be conducted exclusively on the [private maintainer mailing list](mailto:krkn.maintainers@gmail.com) or in a closed Maintainer meeting, in order to prevent accidental public disclosure of sensitive information.
#### Responsibilities:
Most votes require a **simple majority** of all Maintainers to succeed, except where otherwise noted. Two-thirds majority votes mean at least two-thirds of all existing Maintainers.
Review pull requests for correctness, quality, and adherence to project standards.
| Action | Required Vote |
|--------|--------------|
| Adding a new Maintainer | Simple majority |
| Removing a Maintainer | 2/3 majority |
| Approving CNCF resource requests | Simple majority |
| Modifying this charter | 2/3 majority |
Provide constructive and timely feedback to contributors.
## Modifying this Charter
Ensure that all contributions are well-tested and documented.
Work with maintainers to ensure a smooth and efficient release process.
### Maintainer
Maintainers are responsible for the overall health and direction of the project. They are long-standing contributors who have shown a deep commitment to the project's success.
#### Responsibilities:
Set the technical direction and vision for the project.
Manage releases and ensure the stability of the main branch.
Make decisions on feature inclusion and project priorities.
Mentor other contributors and help grow the community.
Resolve disputes and make final decisions when consensus cannot be reached.
### Owner
Owners have administrative access to the project and are the final decision-makers.
#### Responsibilities:
Manage the core team of maintainers and approvers.
Set the overall vision and strategy for the project.
Handle administrative tasks, such as managing the project's repository and other resources.
Represent the project in the broader open-source community.
# Credits
Sections of this document have been borrowed from [Kubernetes governance](https://github.com/kubernetes/community/blob/master/governance.md)
Changes to this Governance document and its supporting documents may be approved by a 2/3 vote of the Maintainers.

View File

@@ -15,7 +15,7 @@ For detailed description of the roles, see [Governance](./GOVERNANCE.md) page.
| Pradeep Surisetty | [psuriset](https://github.com/psuriset) | psuriset@redhat.com | Owner |
| Paige Patton | [paigerube14](https://github.com/paigerube14) | prubenda@redhat.com | Maintainer |
| Tullio Sebastiani | [tsebastiani](https://github.com/tsebastiani) | tsebasti@redhat.com | Maintainer |
| Yogananth Subramanian | [yogananth-subramanian](https://github.com/yogananth-subramanian) | ysubrama@redhat.com |Maintainer |
| Yogananth Subramanian | [yogananth-subramanian](https://github.com/yogananth-subramanian) | ysubrama@redhat.com | Maintainer |
| Sahil Shah | [shahsahil264](https://github.com/shahsahil264) | sahshah@redhat.com | Member |
@@ -32,3 +32,64 @@ The roles are:
* Maintainer: A contributor who is responsible for the overall health and direction of the project.
* Owner: A contributor who has administrative ownership of the project.
## Maintainer Levels
### Contributor
Contributors contributor to the community. Anyone can become a contributor by participating in discussions, reporting bugs, or contributing code or documentation.
#### Responsibilities:
Be active in the community and adhere to the Code of Conduct.
Report bugs and suggest new features.
Contribute high-quality code and documentation.
### Member
Members are active contributors to the community. Members have demonstrated a strong understanding of the project's codebase and conventions.
#### Responsibilities:
Review pull requests for correctness, quality, and adherence to project standards.
Provide constructive and timely feedback to contributors.
Ensure that all contributions are well-tested and documented.
Work with maintainers to ensure a smooth and efficient release process.
### Maintainer
Maintainers are responsible for the overall health and direction of the project. They are long-standing contributors who have shown a deep commitment to the project's success.
#### Responsibilities:
Set the technical direction and vision for the project.
Manage releases and ensure the stability of the main branch.
Make decisions on feature inclusion and project priorities.
Mentor other contributors and help grow the community.
Resolve disputes and make final decisions when consensus cannot be reached.
### Owner
Owners have administrative access to the project and are the final decision-makers.
#### Responsibilities:
Manage the core team of maintainers and approvers.
Set the overall vision and strategy for the project.
Handle administrative tasks, such as managing the project's repository and other resources.
Represent the project in the broader open-source community.
## Email
If you'd like to contact the krkn maintainers about a specific issue you're having, please reach out to use at krkn.maintainers@gmail.com.

View File

@@ -56,7 +56,7 @@ kraken:
- scenarios/kubevirt/kubevirt-vm-outage.yaml
resiliency:
resiliency_run_mode: standalone # Options: standalone, controller, disabled
resiliency_run_mode: standalone # Options: standalone, detailed, disabled
resiliency_file: config/alerts.yaml # Path to SLO definitions, will resolve to performance_monitoring: alert_profile: if not specified
cerberus:

View File

@@ -558,5 +558,31 @@
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "resiliency-score",
"short_description": "Enable resiliency score calculation",
"description": "The system outputs a detailed resiliency score as a single-line JSON object, facilitating easy aggregation across multiple test scenarios.",
"variable": "RESILIENCY_SCORE",
"type": "boolean",
"required": "false"
},
{
"name": "disable-resiliency-score",
"short_description": "Disable resiliency score calculation",
"description": "Disable resiliency score calculation",
"variable": "DISABLE_RESILIENCY_SCORE",
"type": "boolean",
"required": "false"
},
{
"name": "resiliency-file",
"short_description": "Resiliency Score metrics file",
"description": "Custom Resiliency score file",
"variable": "RESILIENCY_FILE",
"type": "file",
"required": "false",
"mount_path": "/home/krkn/resiliency-file.yaml"
}
]

View File

@@ -251,7 +251,7 @@ def metrics(
for k,v in pod.items():
metric[k] = v
metric['timestamp'] = str(datetime.datetime.now())
print('adding pod' + str(metric))
logging.debug("adding pod %s", metric)
metrics_list.append(metric.copy())
for affected_node in scenario["affected_nodes"]:
metric_name = "affected_nodes_recovery"

View File

@@ -306,7 +306,7 @@ class Resiliency:
prom_cli: Pre-configured KrknPrometheus instance.
total_start_time: Start time for the full test window.
total_end_time: End time for the full test window.
run_mode: "controller" or "standalone" mode.
run_mode: "detailed" or "standalone" mode.
Returns:
(detailed_report)
@@ -320,7 +320,7 @@ class Resiliency:
)
detailed = self.get_detailed_report()
if run_mode == "controller":
if run_mode == "detailed":
# krknctl expects the detailed report on stdout in a special format
try:
detailed_json = json.dumps(detailed)

View File

@@ -1,4 +1,5 @@
import logging
import os
import time
from abc import ABC, abstractmethod
from krkn_lib.models.telemetry import ScenarioTelemetry
@@ -86,6 +87,16 @@ class AbstractScenarioPlugin(ABC):
scenario_telemetry.scenario = scenario_config
scenario_telemetry.scenario_type = self.get_scenario_types()[0]
scenario_telemetry.start_timestamp = time.time()
if not os.path.exists(scenario_config):
logging.error(
f"scenario file not found: '{scenario_config}' -- "
f"check that the path is correct relative to the working directory: {os.getcwd()}"
)
failed_scenarios.append(scenario_config)
scenario_telemetry.exit_status = 1
scenario_telemetry.end_timestamp = time.time()
scenario_telemetries.append(scenario_telemetry)
continue
parsed_scenario_config = telemetry.set_parameters_base64(
scenario_telemetry, scenario_config
)
@@ -147,7 +158,7 @@ class AbstractScenarioPlugin(ABC):
failed_scenarios.append(scenario_config)
scenario_telemetries.append(scenario_telemetry)
cerberus.publish_kraken_status(start_time,end_time)
logging.info(f"wating {wait_duration} before running the next scenario")
logging.info(f"waiting {wait_duration} before running the next scenario")
time.sleep(wait_duration)
return failed_scenarios, scenario_telemetries

View File

@@ -1,8 +1,7 @@
import logging
import time
from typing import Dict, Any, Optional
from typing import Dict, Any
import random
import re
import yaml
from kubernetes.client.rest import ApiException
from krkn_lib.k8s import KrknKubernetes
@@ -35,7 +34,6 @@ class KubevirtVmOutageScenarioPlugin(AbstractScenarioPlugin):
self,
run_uuid: str,
scenario: str,
krkn_config: dict[str, any],
lib_telemetry: KrknTelemetryOpenshift,
scenario_telemetry: ScenarioTelemetry,
) -> int:
@@ -60,7 +58,7 @@ class KubevirtVmOutageScenarioPlugin(AbstractScenarioPlugin):
return 0
except Exception as e:
logging.error(f"KubeVirt VM Outage scenario failed: {e}")
log_exception(e)
log_exception(str(e))
return 1
def init_clients(self, k8s_client: KrknKubernetes):
@@ -143,7 +141,7 @@ class KubevirtVmOutageScenarioPlugin(AbstractScenarioPlugin):
except Exception as e:
logging.error(f"Error executing KubeVirt VM outage scenario: {e}")
log_exception(e)
log_exception(str(e))
return self.pods_status
def validate_environment(self, vm_name: str, namespace: str) -> bool:
@@ -243,7 +241,7 @@ class KubevirtVmOutageScenarioPlugin(AbstractScenarioPlugin):
except Exception as e:
logging.error(f"Error deleting VMI {vm_name}: {e}")
log_exception(e)
log_exception(str(e))
self.pods_status.unrecovered.append(self.affected_pod)
return 1
@@ -304,7 +302,7 @@ class KubevirtVmOutageScenarioPlugin(AbstractScenarioPlugin):
except Exception as e:
logging.error(f"Error recreating VMI {vm_name}: {e}")
log_exception(e)
log_exception(str(e))
return 1
else:
logging.error(f"Failed to recover VMI {vm_name}: No original state captured and auto-recovery did not occur")
@@ -312,5 +310,5 @@ class KubevirtVmOutageScenarioPlugin(AbstractScenarioPlugin):
except Exception as e:
logging.error(f"Unexpected error recovering VMI {vm_name}: {e}")
log_exception(e)
log_exception(str(e))
return 1

View File

@@ -1,5 +1,4 @@
import logging
import time
import yaml
from krkn_lib.k8s import KrknKubernetes
@@ -28,7 +27,6 @@ class ManagedClusterScenarioPlugin(AbstractScenarioPlugin):
)
if managedcluster_scenario["actions"]:
for action in managedcluster_scenario["actions"]:
start_time = int(time.time())
try:
self.inject_managedcluster_scenario(
action,
@@ -44,6 +42,7 @@ class ManagedClusterScenarioPlugin(AbstractScenarioPlugin):
return 1
else:
return 0
return 0
def inject_managedcluster_scenario(
self,

View File

@@ -36,7 +36,7 @@ class TimeActionsScenarioPlugin(AbstractScenarioPlugin):
)
if len(not_reset) > 0:
logging.info("Object times were not reset")
except (RuntimeError, Exception):
except (RuntimeError, Exception) as e:
logging.error(
f"TimeActionsScenarioPlugin scenario {scenario} failed with exception: {e}"
)

View File

@@ -1,3 +1,5 @@
import base64
import json
import logging
import time
@@ -13,11 +15,15 @@ from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
from krkn_lib.utils import get_yaml_item_value
from krkn.rollback.config import RollbackContent
from krkn.rollback.handler import set_rollback_context_decorator
from krkn.scenario_plugins.node_actions.aws_node_scenarios import AWS
from krkn.scenario_plugins.node_actions.gcp_node_scenarios import gcp_node_scenarios
class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
@set_rollback_context_decorator
def run(
self,
run_uuid: str,
@@ -40,7 +46,9 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
if cloud_type.lower() == "gcp":
affected_nodes_status = AffectedNodeStatus()
self.cloud_object = gcp_node_scenarios(kubecli, kube_check, affected_nodes_status)
self.node_based_zone(scenario_config, kubecli)
result = self.node_based_zone(scenario_config, kubecli)
if result != 0:
return result
affected_nodes_status = self.cloud_object.affected_nodes_status
scenario_telemetry.affected_nodes.extend(affected_nodes_status.affected_nodes)
else:
@@ -57,22 +65,37 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
return 1
else:
return 0
def node_based_zone(self, scenario_config: dict[str, any], kubecli: KrknKubernetes ):
def node_based_zone(self, scenario_config: dict[str, any], kubecli: KrknKubernetes):
zone = scenario_config["zone"]
duration = get_yaml_item_value(scenario_config, "duration", 60)
timeout = get_yaml_item_value(scenario_config, "timeout", 180)
kube_check = get_yaml_item_value(scenario_config, "kube_check", True)
label_selector = f"topology.kubernetes.io/zone={zone}"
try:
try:
# get list of nodes in zone/region
nodes = kubecli.list_killable_nodes(label_selector)
# stop nodes in parallel
pool = ThreadPool(processes=len(nodes))
pool.starmap(
self.cloud_object.node_stop_scenario,zip(repeat(1), nodes, repeat(timeout))
# set rollback callable before stopping nodes
rollback_data = {
"nodes": nodes,
"timeout": timeout,
"kube_check": kube_check,
}
encoded = base64.b64encode(
json.dumps(rollback_data).encode("utf-8")
).decode("utf-8")
self.rollback_handler.set_rollback_callable(
self.rollback_gcp_zone_outage,
RollbackContent(resource_identifier=encoded),
)
# stop nodes in parallel
pool = ThreadPool(processes=len(nodes))
pool.starmap(
self.cloud_object.node_stop_scenario,
zip(repeat(1), nodes, repeat(timeout), repeat(None)),
)
pool.close()
logging.info(
@@ -80,10 +103,11 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
)
time.sleep(duration)
# start nodes in parallel
# start nodes in parallel
pool = ThreadPool(processes=len(nodes))
pool.starmap(
self.cloud_object.node_start_scenario,zip(repeat(1), nodes, repeat(timeout))
self.cloud_object.node_start_scenario,
zip(repeat(1), nodes, repeat(timeout), repeat(None)),
)
pool.close()
except Exception as e:
@@ -94,6 +118,58 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
else:
return 0
@staticmethod
def rollback_gcp_zone_outage(
rollback_content: RollbackContent,
lib_telemetry: KrknTelemetryOpenshift,
):
"""Rollback function to restart stopped nodes after a GCP zone outage
scenario failure.
:param rollback_content: Rollback content containing encoded node
list and config.
:param lib_telemetry: Instance of KrknTelemetryOpenshift for
Kubernetes operations.
"""
try:
import json
import base64
from krkn_lib.models.k8s import AffectedNodeStatus
from krkn.scenario_plugins.node_actions.gcp_node_scenarios import (
gcp_node_scenarios,
)
decoded = base64.b64decode(
rollback_content.resource_identifier.encode("utf-8")
).decode("utf-8")
rollback_data = json.loads(decoded)
nodes = rollback_data["nodes"]
timeout = rollback_data["timeout"]
kube_check = rollback_data["kube_check"]
kubecli = lib_telemetry.get_lib_kubernetes()
affected_nodes_status = AffectedNodeStatus()
cloud_object = gcp_node_scenarios(
kubecli, kube_check, affected_nodes_status
)
logging.info(
"Rolling back GCP zone outage: starting %d stopped nodes"
% len(nodes)
)
for node in nodes:
try:
cloud_object.node_start_scenario(1, node, timeout, None)
except Exception as node_error:
logging.error(
"Failed to start node %s during rollback: %s"
% (node, node_error)
)
logging.info("GCP zone outage rollback completed.")
except Exception as e:
logging.error("Failed to rollback GCP zone outage: %s" % e)
raise
def network_based_zone(self, scenario_config: dict[str, any]):
vpc_id = scenario_config["vpc_id"]
@@ -118,12 +194,12 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
"Network association ids associated with "
"the subnet %s: %s" % (subnet_id, network_association_ids)
)
# Use provided default ACL if available, otherwise create a new one
if default_acl_id:
acl_id = default_acl_id
logging.info(
"Using provided default ACL ID %s - this ACL will not be deleted after the scenario",
"Using provided default ACL ID %s - this ACL will not be deleted after the scenario",
default_acl_id
)
# Don't add to acl_ids_created since we don't want to delete user-provided ACLs at cleanup
@@ -160,6 +236,5 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
for acl_id in acl_ids_created:
self.cloud_object.delete_network_acl(acl_id)
def get_scenario_types(self) -> list[str]:
return ["zone_outages_scenarios"]

View File

@@ -171,7 +171,7 @@ class VirtChecker:
if new_node_name and vm.node_name != new_node_name:
vm.node_name = new_node_name
except Exception:
logging.info('Exception in get vm status')
logging.exception("Exception in get vm status")
vm_status = False
if vm.vm_name not in virt_check_tracker:

View File

@@ -1,4 +1,10 @@
from .TeeLogHandler import TeeLogHandler
from .ErrorLog import ErrorLog
from .ErrorCollectionHandler import ErrorCollectionHandler
from .functions import *
from .functions import (
populate_cluster_events,
collect_and_put_ocp_logs,
KrknKubernetes,
ScenarioTelemetry,
KrknTelemetryOpenshift
)

View File

@@ -65,8 +65,6 @@ def main(options, command: Optional[str]) -> int:
if os.path.isfile(cfg):
with open(cfg, "r") as f:
config = yaml.full_load(f)
global kubeconfig_path, wait_duration, kraken_config
kubeconfig_path = os.path.expanduser(
get_yaml_item_value(config["kraken"], "kubeconfig_path", "")
)
@@ -95,7 +93,7 @@ def main(options, command: Optional[str]) -> int:
run_signal = get_yaml_item_value(config["kraken"], "signal_state", "RUN")
resiliency_config = get_yaml_item_value(config,"resiliency",{})
# Determine execution mode (standalone, controller, or disabled)
# Determine execution mode (standalone, detailed, or disabled)
run_mode = get_yaml_item_value(resiliency_config, "resiliency_run_mode", "standalone")
valid_run_modes = {"standalone", "detailed", "disabled"}
if run_mode not in valid_run_modes:

View File

@@ -1,4 +1,4 @@
duration: 60
duration: 10
workers: '' # leave it empty '' node cpu auto-detection
hog-type: cpu
image: quay.io/krkn-chaos/krkn-hog

View File

@@ -66,9 +66,10 @@ class TestAbstractScenarioPluginCerberusIntegration(unittest.TestCase):
@patch('krkn.scenario_plugins.abstract_scenario_plugin.cleanup_rollback_version_files')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.utils.collect_and_put_ocp_logs')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.signal_handler.signal_context')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.os.path.exists', return_value=True)
@patch('time.sleep')
def test_cerberus_publish_called_after_successful_scenario(
self, mock_sleep, mock_signal_ctx, mock_collect_logs, mock_cleanup, mock_cerberus_publish
self, mock_sleep, mock_exists, mock_signal_ctx, mock_collect_logs, mock_cleanup, mock_cerberus_publish
):
"""Test that cerberus.publish_kraken_status is called after a successful scenario"""
mock_signal_ctx.return_value.__enter__ = Mock()
@@ -97,9 +98,10 @@ class TestAbstractScenarioPluginCerberusIntegration(unittest.TestCase):
@patch('krkn.scenario_plugins.abstract_scenario_plugin.execute_rollback_version_files')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.utils.collect_and_put_ocp_logs')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.signal_handler.signal_context')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.os.path.exists', return_value=True)
@patch('time.sleep')
def test_cerberus_publish_called_after_failed_scenario(
self, mock_sleep, mock_signal_ctx, mock_collect_logs, mock_rollback, mock_cerberus_publish
self, mock_sleep, mock_exists, mock_signal_ctx, mock_collect_logs, mock_rollback, mock_cerberus_publish
):
"""Test that cerberus.publish_kraken_status is called even after a failed scenario"""
mock_signal_ctx.return_value.__enter__ = Mock()
@@ -122,9 +124,10 @@ class TestAbstractScenarioPluginCerberusIntegration(unittest.TestCase):
@patch('krkn.scenario_plugins.abstract_scenario_plugin.cleanup_rollback_version_files')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.utils.collect_and_put_ocp_logs')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.signal_handler.signal_context')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.os.path.exists', return_value=True)
@patch('time.sleep')
def test_cerberus_publish_called_for_multiple_scenarios(
self, mock_sleep, mock_signal_ctx, mock_collect_logs, mock_cleanup, mock_cerberus_publish
self, mock_sleep, mock_exists, mock_signal_ctx, mock_collect_logs, mock_cleanup, mock_cerberus_publish
):
"""Test that cerberus.publish_kraken_status is called for each scenario"""
mock_signal_ctx.return_value.__enter__ = Mock()
@@ -148,10 +151,11 @@ class TestAbstractScenarioPluginCerberusIntegration(unittest.TestCase):
@patch('krkn.scenario_plugins.abstract_scenario_plugin.execute_rollback_version_files')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.utils.collect_and_put_ocp_logs')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.signal_handler.signal_context')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.os.path.exists', return_value=True)
@patch('time.sleep')
@patch('time.time')
def test_cerberus_publish_timing(
self, mock_time, mock_sleep, mock_signal_ctx, mock_collect_logs,
self, mock_time, mock_sleep, mock_exists, mock_signal_ctx, mock_collect_logs,
mock_rollback, mock_cleanup, mock_cerberus_publish
):
"""Test that cerberus.publish_kraken_status receives correct timestamps"""
@@ -181,9 +185,10 @@ class TestAbstractScenarioPluginCerberusIntegration(unittest.TestCase):
@patch('krkn.scenario_plugins.abstract_scenario_plugin.cleanup_rollback_version_files')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.utils.collect_and_put_ocp_logs')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.signal_handler.signal_context')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.os.path.exists', return_value=True)
@patch('time.sleep')
def test_cerberus_publish_exception_does_not_break_flow(
self, mock_sleep, mock_signal_ctx, mock_collect_logs, mock_cleanup, mock_cerberus_publish
self, mock_sleep, mock_exists, mock_signal_ctx, mock_collect_logs, mock_cleanup, mock_cerberus_publish
):
"""Test that exceptions in cerberus.publish_kraken_status don't break scenario execution"""
mock_signal_ctx.return_value.__enter__ = Mock()
@@ -210,9 +215,10 @@ class TestAbstractScenarioPluginCerberusIntegration(unittest.TestCase):
@patch('krkn.scenario_plugins.abstract_scenario_plugin.execute_rollback_version_files')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.utils.collect_and_put_ocp_logs')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.signal_handler.signal_context')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.os.path.exists', return_value=True)
@patch('time.sleep')
def test_cerberus_publish_called_for_mixed_success_and_failure(
self, mock_sleep, mock_signal_ctx, mock_collect_logs, mock_rollback,
self, mock_sleep, mock_exists, mock_signal_ctx, mock_collect_logs, mock_rollback,
mock_cleanup, mock_cerberus_publish
):
"""Test cerberus publish is called for both successful and failed scenarios"""
@@ -250,9 +256,10 @@ class TestAbstractScenarioPluginCerberusIntegration(unittest.TestCase):
@patch('krkn.scenario_plugins.abstract_scenario_plugin.cerberus.publish_kraken_status')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.utils.collect_and_put_ocp_logs')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.signal_handler.signal_context')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.os.path.exists', return_value=True)
@patch('time.sleep')
def test_cerberus_not_called_for_deprecated_post_scenarios(
self, mock_sleep, mock_signal_ctx, mock_collect_logs, mock_cerberus_publish
self, mock_sleep, mock_exists, mock_signal_ctx, mock_collect_logs, mock_cerberus_publish
):
"""Test that cerberus is not called for deprecated post scenarios (list format)"""
mock_signal_ctx.return_value.__enter__ = Mock()
@@ -277,9 +284,10 @@ class TestAbstractScenarioPluginCerberusIntegration(unittest.TestCase):
@patch('krkn.scenario_plugins.abstract_scenario_plugin.utils.collect_and_put_ocp_logs')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.utils.populate_cluster_events')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.signal_handler.signal_context')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.os.path.exists', return_value=True)
@patch('time.sleep')
def test_cerberus_called_with_events_backup_enabled(
self, mock_sleep, mock_signal_ctx, mock_populate_events,
self, mock_sleep, mock_exists, mock_signal_ctx, mock_populate_events,
mock_collect_logs, mock_cleanup, mock_cerberus_publish
):
"""Test that cerberus is called even when events_backup is enabled"""
@@ -308,9 +316,10 @@ class TestAbstractScenarioPluginCerberusIntegration(unittest.TestCase):
@patch('krkn.scenario_plugins.abstract_scenario_plugin.execute_rollback_version_files')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.utils.collect_and_put_ocp_logs')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.signal_handler.signal_context')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.os.path.exists', return_value=True)
@patch('time.sleep')
def test_cerberus_called_after_exception_in_run(
self, mock_sleep, mock_signal_ctx, mock_collect_logs,
self, mock_sleep, mock_exists, mock_signal_ctx, mock_collect_logs,
mock_rollback, mock_cerberus_publish
):
"""Test that cerberus is called even if run() raises an uncaught exception"""
@@ -345,5 +354,73 @@ class TestAbstractScenarioPluginCerberusIntegration(unittest.TestCase):
self.assertEqual(telemetries[0].exit_status, 1)
@patch('krkn.scenario_plugins.abstract_scenario_plugin.cerberus.publish_kraken_status')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.os.path.exists', return_value=False)
@patch('time.sleep')
def test_missing_scenario_file_logs_error_and_marks_failed(
self, mock_sleep, mock_exists, mock_cerberus_publish
):
"""Test that a missing scenario file logs a clear error and is marked as failed without crashing"""
scenarios_list = ["scenarios/openshift/cnv.yml"]
with self.assertLogs('root', level='ERROR') as log_ctx:
failed_scenarios, telemetries = self.plugin.run_scenarios(
"test-uuid",
scenarios_list,
self.krkn_config,
self.mock_telemetry,
)
# scenario is marked failed and returned in failed list
self.assertEqual(len(failed_scenarios), 1)
self.assertEqual(failed_scenarios[0], "scenarios/openshift/cnv.yml")
# telemetry recorded with exit_status=1
self.assertEqual(len(telemetries), 1)
self.assertEqual(telemetries[0].exit_status, 1)
# error message contains the missing path
self.assertTrue(
any("scenarios/openshift/cnv.yml" in msg for msg in log_ctx.output),
f"Expected file path in error log, got: {log_ctx.output}",
)
# set_parameters_base64 and cerberus should not be called
self.mock_telemetry.set_parameters_base64.assert_not_called()
mock_cerberus_publish.assert_not_called()
@patch('krkn.scenario_plugins.abstract_scenario_plugin.cerberus.publish_kraken_status')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.cleanup_rollback_version_files')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.utils.collect_and_put_ocp_logs')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.signal_handler.signal_context')
@patch('krkn.scenario_plugins.abstract_scenario_plugin.os.path.exists')
@patch('time.sleep')
def test_missing_scenario_file_skipped_others_continue(
self, mock_sleep, mock_exists, mock_signal_ctx, mock_collect_logs,
mock_cleanup, mock_cerberus_publish
):
"""Test that a missing file is skipped and remaining scenarios still run"""
mock_signal_ctx.return_value.__enter__ = Mock()
mock_signal_ctx.return_value.__exit__ = Mock(return_value=False)
# first file missing, second exists
mock_exists.side_effect = [False, True]
scenarios_list = ["missing.yml", "scenario2.yaml"]
with self.assertLogs('root', level='ERROR'):
failed_scenarios, telemetries = self.plugin.run_scenarios(
"test-uuid",
scenarios_list,
self.krkn_config,
self.mock_telemetry,
)
self.assertIn("missing.yml", failed_scenarios)
self.assertNotIn("scenario2.yaml", failed_scenarios)
self.assertEqual(len(telemetries), 2)
# cerberus called only for the scenario that ran
self.assertEqual(mock_cerberus_publish.call_count, 1)
if __name__ == '__main__':
unittest.main()

View File

@@ -176,7 +176,7 @@ class TestKubevirtVmOutageScenarioPlugin(unittest.TestCase):
self.k8s_client.delete_vmi.return_value = None
with patch("builtins.open", unittest.mock.mock_open(read_data=yaml.dump(self.config))):
result = self.plugin.run("test-uuid", self.scenario_file, {}, self.telemetry, self.scenario_telemetry)
result = self.plugin.run("test-uuid", self.scenario_file, self.telemetry, self.scenario_telemetry)
self.assertEqual(result, 0)
self.k8s_client.delete_vmi.assert_called_once_with("test-vm", "default")
@@ -196,7 +196,7 @@ class TestKubevirtVmOutageScenarioPlugin(unittest.TestCase):
self.k8s_client.delete_vmi.side_effect = ApiException(status=500)
with patch("builtins.open", unittest.mock.mock_open(read_data=yaml.dump(self.config))):
result = self.plugin.run("test-uuid", self.scenario_file, {}, self.telemetry, self.scenario_telemetry)
result = self.plugin.run("test-uuid", self.scenario_file, self.telemetry, self.scenario_telemetry)
self.assertEqual(result, 1)
self.k8s_client.delete_vmi.assert_called_once_with("test-vm", "default")
@@ -234,7 +234,7 @@ class TestKubevirtVmOutageScenarioPlugin(unittest.TestCase):
self.k8s_client.delete_vmi.return_value = None
with patch("builtins.open", unittest.mock.mock_open(read_data=yaml.dump(self.config))):
result = self.plugin.run("test-uuid", self.scenario_file, {}, self.telemetry, self.scenario_telemetry)
result = self.plugin.run("test-uuid", self.scenario_file, self.telemetry, self.scenario_telemetry)
self.assertEqual(result, 0)
# Verify patch_vm was called to disable auto-restart
@@ -278,7 +278,7 @@ class TestKubevirtVmOutageScenarioPlugin(unittest.TestCase):
self.k8s_client.get_vmi.return_value = None
with patch("builtins.open", unittest.mock.mock_open(read_data=yaml.dump(self.config))):
result = self.plugin.run("test-uuid", self.scenario_file, {}, self.telemetry, self.scenario_telemetry)
result = self.plugin.run("test-uuid", self.scenario_file, self.telemetry, self.scenario_telemetry)
# When validation fails, run() returns 1 due to exception handling
self.assertEqual(result, 1)

View File

@@ -0,0 +1,323 @@
#!/usr/bin/env python3
"""
Tests for fixes introduced in issues #24#28.
Stubs all external dependencies (krkn_lib, kubernetes, broken urllib3)
so tests run without any additional installs.
Usage (run from repo root):
python3 -m coverage run -a -m unittest tests/test_fixes_24_to_28.py -v
"""
import queue
import sys
import types
import unittest
from unittest.mock import MagicMock, patch
# ---------------------------------------------------------------------------
# Inject minimal stubs for every external dependency
# ---------------------------------------------------------------------------
def _inject(name, **attrs):
mod = types.ModuleType(name)
for k, v in attrs.items():
setattr(mod, k, v)
sys.modules.setdefault(name, mod)
return sys.modules[name]
# -- krkn_lib ----------------------------------------------------------------
_inject("krkn_lib")
_inject("krkn_lib.utils", deep_get_attribute=MagicMock(return_value=[]))
_inject("krkn_lib.utils.functions",
get_yaml_item_value=MagicMock(
side_effect=lambda cfg, key, default: (
cfg.get(key, default) if isinstance(cfg, dict) else default
)
))
_inject("krkn_lib.models.telemetry",
ScenarioTelemetry=MagicMock(), ChaosRunTelemetry=MagicMock())
class _VirtCheck:
def __init__(self, d):
for k, v in d.items():
setattr(self, k, v)
_inject("krkn_lib.models.telemetry.models", VirtCheck=_VirtCheck)
_inject("krkn_lib.models.krkn",
ChaosRunAlertSummary=MagicMock(), ChaosRunAlert=MagicMock())
_inject("krkn_lib.models.elastic.models", ElasticAlert=MagicMock())
_inject("krkn_lib.models.elastic", ElasticChaosRunTelemetry=MagicMock())
_inject("krkn_lib.models.k8s", ResiliencyReport=MagicMock())
_inject("krkn_lib.elastic.krkn_elastic", KrknElastic=MagicMock())
_inject("krkn_lib.prometheus.krkn_prometheus", KrknPrometheus=MagicMock())
_inject("krkn_lib.telemetry.ocp", KrknTelemetryOpenshift=MagicMock())
_inject("krkn_lib.telemetry.k8s", KrknTelemetryKubernetes=MagicMock())
_inject("krkn_lib.k8s", KrknKubernetes=MagicMock())
_inject("krkn_lib.ocp", KrknOpenshift=MagicMock())
# -- broken third-party ------------------------------------------------------
# urllib3.exceptions doesn't export HTTPError on this Python version
import urllib3.exceptions # noqa: E402 (real module, just patch the attr)
if not hasattr(urllib3.exceptions, "HTTPError"):
urllib3.exceptions.HTTPError = Exception
# kubernetes stub the whole chain before anything imports it
_inject("kubernetes")
_inject("kubernetes.client")
_inject("kubernetes.client.rest", ApiException=type("ApiException", (Exception,), {}))
# -- other stubs needed by krkn internals ------------------------------------
_inject("tzlocal")
_inject("tzlocal.unix", get_localzone=MagicMock(return_value="UTC"))
# kubevirt plugin (imports kubernetes.client.rest)
_KubevirtPlugin = MagicMock()
_inject(
"krkn.scenario_plugins.kubevirt_vm_outage"
".kubevirt_vm_outage_scenario_plugin",
KubevirtVmOutageScenarioPlugin=_KubevirtPlugin,
)
# -- yaml (real or stub) -----------------------------------------------------
try:
import yaml as _yaml # noqa: F401
except ImportError:
_inject("yaml")
# ---------------------------------------------------------------------------
# Now import the actual krkn modules under test
# ---------------------------------------------------------------------------
from krkn.prometheus import client # noqa: E402
from krkn.utils import VirtChecker as VirtCheckerModule # noqa: E402
from krkn.utils.VirtChecker import VirtChecker # noqa: E402
# ===========================================================================
# #1 — Typo "wating" -> "waiting"
# ===========================================================================
class TestIssue24TypoFix(unittest.TestCase):
"""#24: Log message must spell 'waiting' correctly."""
def test_no_wating_typo_in_source(self):
import pathlib
src = pathlib.Path("krkn/scenario_plugins/abstract_scenario_plugin.py").read_text()
self.assertNotIn('"wating ', src,
"Typo 'wating' still present in abstract_scenario_plugin.py")
def test_waiting_present_in_source(self):
import pathlib
src = pathlib.Path("krkn/scenario_plugins/abstract_scenario_plugin.py").read_text()
self.assertIn('"waiting ', src,
"'waiting' not found in abstract_scenario_plugin.py")
# ===========================================================================
# #2 — print() replaced by logging.debug()
# ===========================================================================
class TestIssue25NoPrintInClient(unittest.TestCase):
"""#25: client.py must not use print() for pod metric messages."""
def test_no_print_adding_pod(self):
import pathlib
src = pathlib.Path("krkn/prometheus/client.py").read_text()
self.assertNotIn("print('adding pod'", src)
self.assertNotIn('print("adding pod"', src)
def test_logging_debug_used(self):
import pathlib
src = pathlib.Path("krkn/prometheus/client.py").read_text()
self.assertIn('logging.debug("adding pod', src)
def test_metrics_does_not_write_to_stdout(self):
"""metrics() must not emit to stdout for pod telemetry entries."""
import io, json, os, tempfile
prom_cli = MagicMock()
prom_cli.process_prom_query_in_range.return_value = []
prom_cli.process_query.return_value = []
telemetry_data = {
"scenarios": [{
"affected_pods": {
"disrupted": [{"name": "pod-1", "namespace": "default"}]
},
"affected_nodes": [],
}],
"health_checks": [],
"virt_checks": [],
}
profile = tempfile.NamedTemporaryFile(
mode="w", suffix=".yaml", delete=False
)
profile.write("metrics:\n - query: up\n metricName: uptime\n")
profile.close()
elastic = MagicMock()
elastic.upload_metrics_to_elasticsearch.return_value = 0
captured = io.StringIO()
sys.stdout, orig = captured, sys.stdout
try:
client.metrics(
prom_cli, elastic, "uuid-1",
1_000_000.0, 1_000_060.0,
profile.name, "idx",
json.dumps(telemetry_data),
)
finally:
sys.stdout = orig
os.unlink(profile.name)
self.assertEqual(
captured.getvalue(), "",
f"stdout was not empty: {captured.getvalue()!r}",
)
# ===========================================================================
# #3 — Star import removed
# ===========================================================================
class TestIssue26NoStarImport(unittest.TestCase):
"""#26: utils/__init__.py must use explicit imports, not star import."""
def test_no_star_import(self):
import pathlib
src = pathlib.Path("krkn/utils/__init__.py").read_text()
self.assertNotIn("import *", src)
def test_explicit_names_present(self):
import pathlib
src = pathlib.Path("krkn/utils/__init__.py").read_text()
self.assertIn("populate_cluster_events", src)
self.assertIn("collect_and_put_ocp_logs", src)
self.assertIn("KrknKubernetes", src)
self.assertIn("ScenarioTelemetry", src)
self.assertIn("KrknTelemetryOpenshift", src)
def test_functions_accessible_from_package(self):
from krkn import utils
self.assertTrue(hasattr(utils, "populate_cluster_events"))
self.assertTrue(hasattr(utils, "collect_and_put_ocp_logs"))
self.assertTrue(hasattr(utils, "KrknKubernetes"))
self.assertTrue(hasattr(utils, "ScenarioTelemetry"))
self.assertTrue(hasattr(utils, "KrknTelemetryOpenshift"))
# ===========================================================================
# #4 — global declaration removed from main()
# ===========================================================================
class TestIssue27NoGlobalInMain(unittest.TestCase):
"""#27: main() in run_kraken.py must not declare global variables."""
def test_no_global_statement_in_main(self):
import ast, pathlib
src = pathlib.Path("run_kraken.py").read_text()
tree = ast.parse(src)
found = []
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef) and node.name == "main":
for child in ast.walk(node):
if isinstance(child, ast.Global):
found.extend(child.names)
self.assertEqual(found, [],
f"Global declarations found in main(): {found}")
# ===========================================================================
# #5 — Exception logged at ERROR level, not INFO
# ===========================================================================
class TestIssue28ExceptionLogLevel(unittest.TestCase):
"""#28: VirtChecker must log VM status exceptions at ERROR, not INFO."""
def test_no_info_for_vm_exception_in_source(self):
import pathlib
src = pathlib.Path("krkn/utils/VirtChecker.py").read_text()
self.assertNotIn(
"logging.info('Exception in get vm status')", src
)
def test_error_level_present_in_source(self):
import pathlib
src = pathlib.Path("krkn/utils/VirtChecker.py").read_text()
self.assertIn(
'logging.exception("Exception in get vm status")', src
)
def test_runtime_exception_triggers_error_log(self):
"""When get_vm_access raises, the handler must call logging.error."""
config = {}
mock_krkn = MagicMock()
with patch(
"krkn.utils.VirtChecker.get_yaml_item_value",
side_effect=lambda cfg, key, default: (
cfg.get(key, default) if isinstance(cfg, dict) else default
),
):
checker = VirtChecker(config, iterations=1, krkn_lib=mock_krkn)
checker.batch_size = 1
checker.interval = 0
checker.disconnected = False
vm = _VirtCheck({
"vm_name": "vm-1",
"ip_address": "1.2.3.4",
"namespace": "ns",
"node_name": "w1",
"new_ip_address": "",
})
error_calls, info_calls, exception_calls = [], [], []
with (
patch.object(
checker, "get_vm_access",
side_effect=RuntimeError("connection refused"),
),
patch("krkn.utils.VirtChecker.logging") as mock_log,
patch("krkn.utils.VirtChecker.time") as mock_time,
):
mock_log.error.side_effect = (
lambda msg, *a, **kw: error_calls.append(msg % a if a else msg)
)
mock_log.info.side_effect = (
lambda msg, *a, **kw: info_calls.append(msg % a if a else msg)
)
mock_log.exception.side_effect = (
lambda msg, *a, **kw: exception_calls.append(msg % a if a else msg)
)
# End loop after first sleep
mock_time.sleep.side_effect = (
lambda _: setattr(checker, "current_iterations", 999)
)
checker.current_iterations = 0
q = queue.SimpleQueue()
checker.run_virt_check([vm], q)
vm_infos = [m for m in info_calls if "Exception in get vm status" in m]
err_vm_msgs = [m for m in error_calls + exception_calls if "Exception in get vm status" in m]
self.assertEqual(
vm_infos, [],
"Exception still logged at INFO level at runtime",
)
self.assertGreater(
len(err_vm_msgs), 0,
"Exception not logged at ERROR level at runtime",
)
if __name__ == "__main__":
unittest.main()

View File

@@ -597,7 +597,7 @@ class TestFinalizeAndSave(unittest.TestCase):
prom_cli=self.mock_prom,
total_start_time=self.start,
total_end_time=self.end,
run_mode="controller",
run_mode="detailed",
)
mock_print.assert_called()

View File

@@ -10,7 +10,7 @@ Assisted By: Claude Code
"""
import unittest
from unittest.mock import MagicMock
from unittest.mock import MagicMock, patch
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
@@ -35,6 +35,22 @@ class TestTimeActionsScenarioPlugin(unittest.TestCase):
self.assertEqual(result, ["time_scenarios"])
self.assertEqual(len(result), 1)
@patch("krkn.scenario_plugins.time_actions.time_actions_scenario_plugin.logging")
@patch("builtins.open", side_effect=RuntimeError("disk quota exceeded"))
def test_exception_variable_bound_in_except_handler(self, mock_open, mock_logging):
"""run() must bind exception variable so logging shows actual error, not NameError"""
result = self.plugin.run(
run_uuid="test-uuid",
scenario="fake_scenario.yaml",
lib_telemetry=MagicMock(),
scenario_telemetry=MagicMock(),
)
self.assertEqual(result, 1)
logged_msg = mock_logging.error.call_args[0][0]
self.assertIn("disk quota exceeded", logged_msg)
self.assertNotIn("NameError", logged_msg)
if __name__ == "__main__":
unittest.main()

View File

@@ -4,18 +4,26 @@
Test suite for ZoneOutageScenarioPlugin class
Usage:
python -m coverage run -a -m unittest tests/test_zone_outage_scenario_plugin.py -v
python -m coverage run -a -m unittest \
tests/test_zone_outage_scenario_plugin.py -v
Assisted By: Claude Code
"""
import base64
import json
import tempfile
import unittest
from unittest.mock import MagicMock
import uuid
from pathlib import Path
from unittest.mock import MagicMock, patch
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
import yaml
from krkn.scenario_plugins.zone_outage.zone_outage_scenario_plugin import ZoneOutageScenarioPlugin
from krkn.rollback.config import RollbackContent
from krkn.scenario_plugins.zone_outage.zone_outage_scenario_plugin import (
ZoneOutageScenarioPlugin,
)
class TestZoneOutageScenarioPlugin(unittest.TestCase):
@@ -36,5 +44,217 @@ class TestZoneOutageScenarioPlugin(unittest.TestCase):
self.assertEqual(len(result), 1)
class TestRollbackGcpZoneOutage(unittest.TestCase):
"""Tests for the GCP zone outage rollback functionality"""
@patch(
"krkn.scenario_plugins.node_actions."
"gcp_node_scenarios.gcp_node_scenarios"
)
def test_rollback_gcp_zone_outage_success(self, mock_gcp_class):
"""
Test successful rollback starts all stopped nodes
"""
rollback_data = {
"nodes": ["node-1", "node-2", "node-3"],
"timeout": 180,
"kube_check": True,
}
encoded = base64.b64encode(
json.dumps(rollback_data).encode("utf-8")
).decode("utf-8")
rollback_content = RollbackContent(
resource_identifier=encoded,
)
mock_lib_telemetry = MagicMock()
mock_kubecli = MagicMock()
mock_lib_telemetry.get_lib_kubernetes.return_value = mock_kubecli
mock_cloud_instance = MagicMock()
mock_gcp_class.return_value = mock_cloud_instance
ZoneOutageScenarioPlugin.rollback_gcp_zone_outage(
rollback_content, mock_lib_telemetry
)
self.assertEqual(
mock_cloud_instance.node_start_scenario.call_count, 3
)
mock_cloud_instance.node_start_scenario.assert_any_call(
1, "node-1", 180, None
)
mock_cloud_instance.node_start_scenario.assert_any_call(
1, "node-2", 180, None
)
mock_cloud_instance.node_start_scenario.assert_any_call(
1, "node-3", 180, None
)
@patch(
"krkn.scenario_plugins.node_actions."
"gcp_node_scenarios.gcp_node_scenarios"
)
def test_rollback_gcp_zone_outage_partial_failure(self, mock_gcp_class):
"""
Test rollback continues when one node fails to start
"""
rollback_data = {
"nodes": ["node-1", "node-2"],
"timeout": 180,
"kube_check": True,
}
encoded = base64.b64encode(
json.dumps(rollback_data).encode("utf-8")
).decode("utf-8")
rollback_content = RollbackContent(
resource_identifier=encoded,
)
mock_lib_telemetry = MagicMock()
mock_kubecli = MagicMock()
mock_lib_telemetry.get_lib_kubernetes.return_value = mock_kubecli
mock_cloud_instance = MagicMock()
mock_gcp_class.return_value = mock_cloud_instance
mock_cloud_instance.node_start_scenario.side_effect = [
Exception("GCP API error"),
None,
]
ZoneOutageScenarioPlugin.rollback_gcp_zone_outage(
rollback_content, mock_lib_telemetry
)
self.assertEqual(
mock_cloud_instance.node_start_scenario.call_count, 2
)
def test_rollback_gcp_zone_outage_invalid_data(self):
"""
Test rollback raises exception for invalid base64 data
"""
rollback_content = RollbackContent(
resource_identifier="invalid_base64_data",
)
mock_lib_telemetry = MagicMock()
with self.assertRaises(Exception):
ZoneOutageScenarioPlugin.rollback_gcp_zone_outage(
rollback_content, mock_lib_telemetry
)
class TestZoneOutageRun(unittest.TestCase):
"""Tests for the run method of ZoneOutageScenarioPlugin"""
def setUp(self):
self.temp_dir = tempfile.TemporaryDirectory()
self.tmp_path = Path(self.temp_dir.name)
def tearDown(self):
self.temp_dir.cleanup()
def _create_scenario_file(self, config=None):
"""Helper to create a temporary scenario YAML file"""
default_config = {
"zone_outage": {
"cloud_type": "gcp",
"zone": "us-central1-a",
"duration": 1,
"timeout": 10,
"kube_check": True,
}
}
if config:
default_config["zone_outage"].update(config)
scenario_file = self.tmp_path / "test_scenario.yaml"
with open(scenario_file, "w") as f:
yaml.dump(default_config, f)
return str(scenario_file)
def _create_mocks(self):
"""Helper to create mock objects for testing"""
mock_lib_telemetry = MagicMock()
mock_lib_kubernetes = MagicMock()
mock_lib_telemetry.get_lib_kubernetes.return_value = (
mock_lib_kubernetes
)
mock_scenario_telemetry = MagicMock()
return mock_lib_telemetry, mock_lib_kubernetes, mock_scenario_telemetry
@patch("time.sleep")
@patch(
"krkn.scenario_plugins.zone_outage."
"zone_outage_scenario_plugin.gcp_node_scenarios"
)
def test_run_gcp_success(self, mock_gcp_class, mock_sleep):
"""Test successful GCP zone outage scenario execution"""
scenario_file = self._create_scenario_file()
mock_lib_telemetry, mock_lib_kubernetes, mock_scenario_telemetry = (
self._create_mocks()
)
mock_lib_kubernetes.list_killable_nodes.return_value = ["node-1"]
mock_cloud = MagicMock()
mock_gcp_class.return_value = mock_cloud
plugin = ZoneOutageScenarioPlugin()
result = plugin.run(
run_uuid=str(uuid.uuid4()),
scenario=scenario_file,
lib_telemetry=mock_lib_telemetry,
scenario_telemetry=mock_scenario_telemetry,
)
self.assertEqual(result, 0)
mock_lib_kubernetes.list_killable_nodes.assert_called_once()
mock_cloud.node_stop_scenario.assert_called()
mock_cloud.node_start_scenario.assert_called()
def test_run_unsupported_cloud_type(self):
"""Test run returns 1 for unsupported cloud type"""
scenario_file = self._create_scenario_file(
{"cloud_type": "unsupported"}
)
mock_lib_telemetry, mock_lib_kubernetes, mock_scenario_telemetry = (
self._create_mocks()
)
plugin = ZoneOutageScenarioPlugin()
result = plugin.run(
run_uuid=str(uuid.uuid4()),
scenario=scenario_file,
lib_telemetry=mock_lib_telemetry,
scenario_telemetry=mock_scenario_telemetry,
)
self.assertEqual(result, 1)
def test_run_gcp_exception(self):
"""Test run handles exceptions gracefully"""
scenario_file = self._create_scenario_file()
mock_lib_telemetry, mock_lib_kubernetes, mock_scenario_telemetry = (
self._create_mocks()
)
mock_lib_telemetry.get_lib_kubernetes.side_effect = Exception(
"Connection error"
)
plugin = ZoneOutageScenarioPlugin()
result = plugin.run(
run_uuid=str(uuid.uuid4()),
scenario=scenario_file,
lib_telemetry=mock_lib_telemetry,
scenario_telemetry=mock_scenario_telemetry,
)
self.assertEqual(result, 1)
if __name__ == "__main__":
unittest.main()