Go to file

andersonid dd51071592 docs: update all documentation with PatternFly UI Revolution changes

- Update README.md with v2.0.0 PatternFly UI Revolution features
- Add Smart Recommendations Engine and VPA CRD Integration sections
- Update application branding to 'ORU Scanner'
- Add Quay.io migration and GitHub Actions information
- Update DOCUMENTATION.md with current status and script cleanup info
- Update AIAgents-Support.md with complete Phase 2 completion status
- Add PatternFly UI, VPA CRD, and infrastructure improvements
- Update deployment status for OCP 4.15, 4.18, and 4.19 clusters
- Reflect script cleanup (19 obsolete scripts removed)
- Update roadmap to show Phase 2 as completed

2025-10-03 07:36:49 -03:00

.github/workflows

cleanup: remove obsolete scripts and update GitHub Actions for Quay.io

2025-10-03 07:28:05 -03:00

app

fix: format Resource Utilization to show only 1 decimal place

2025-10-03 07:18:45 -03:00

k8s

fix: format Resource Utilization to show only 1 decimal place

2025-10-03 07:18:45 -03:00

scripts

cleanup: remove obsolete scripts and update GitHub Actions for Quay.io

2025-10-03 07:28:05 -03:00

.env.example

Initial commit: OpenShift Resource Governance Tool

2025-09-25 14:26:24 -03:00

.gitignore

Revert: Put AIAgents-Support.md back in .gitignore as it's for AI agent context only

2025-09-30 16:41:56 -03:00

Dockerfile

Add system namespace filtering

2025-09-25 17:39:33 -03:00

Dockerfile.simple

Add: scripts de deploy completo com ImagePullSecret para cluster-admin

2025-09-25 15:24:31 -03:00

DOCUMENTATION.md

docs: update all documentation with PatternFly UI Revolution changes

2025-10-03 07:36:49 -03:00

Makefile

Add: scripts de deploy completo com ImagePullSecret para cluster-admin

2025-09-25 15:24:31 -03:00

openshift-git-deploy.yaml

Update to use Docker Hub registry

2025-09-25 14:46:09 -03:00

README.md

docs: update all documentation with PatternFly UI Revolution changes

2025-10-03 07:36:49 -03:00

requirements.txt

Add: scripts de deploy completo com ImagePullSecret para cluster-admin

2025-09-25 15:24:31 -03:00

setup.sh

Translate all Portuguese text to English

2025-09-25 21:05:41 -03:00

README.md

ORU Scanner - OpenShift Resource Usage Scanner

A comprehensive tool for analyzing user workloads and resource usage in OpenShift clusters that goes beyond what Metrics Server and VPA offer, providing validations, reports and consolidated recommendations.

🚀 Features

Automatic Collection: Collects requests/limits from all pods/containers in the cluster
Red Hat Validations: Validates capacity management best practices with specific request/limit values
Smart Resource Analysis: Identifies workloads without requests/limits and provides detailed analysis
Detailed Problem Analysis: Modal-based detailed view showing pod and container resource issues
Smart Recommendations Engine: PatternFly-based gallery with individual workload cards and bulk selection
VPA CRD Integration: Real Kubernetes API integration for Vertical Pod Autoscaler management
Historical Analysis: Workload-based historical resource usage analysis with real numerical data (1h, 6h, 24h, 7d)
Prometheus Integration: Collects real consumption metrics from OpenShift monitoring with OpenShift-specific queries
Cluster Overcommit Analysis: Real-time cluster capacity vs requests analysis with detailed tooltips and modals
PromQL Query Display: Shows raw Prometheus queries used for data collection, allowing validation in OpenShift console
Export Reports: Generates reports in JSON, CSV formats
Modern Web UI: PatternFly design system with professional interface and responsive layout
Cluster Agnostic: Works on any OpenShift cluster without configuration

📋 Requirements

OpenShift 4.x
Prometheus (native in OCP)
VPA (optional, for recommendations)
Python 3.11+
Podman (preferred)
OpenShift CLI (oc)

🛠️ Installation

🚀 Quick Deploy (Recommended)

# 1. Clone the repository
git clone https://github.com/andersonid/openshift-resource-governance.git
cd openshift-resource-governance

# 2. Login to OpenShift
oc login <cluster-url>

# 3. Complete deploy (creates everything automatically)
./scripts/deploy-complete.sh

📋 Manual Deploy (Development)

# Build and push image
./scripts/build-and-push.sh

# Deploy to OpenShift
oc apply -f k8s/

# Wait for deployment
oc rollout status deployment/resource-governance -n resource-governance

🗑️ Undeploy

# Completely remove application
./scripts/undeploy-complete.sh

🌐 Application Access

After deploy, access the application through the created route:

# Get route URL
oc get route -n resource-governance

# Access via browser (URL will be automatically generated)
# Example: https://oru.apps.your-cluster.com

🔧 Configuration

ConfigMap

The application is configured through the ConfigMap resource-governance-config:

data:
  CPU_LIMIT_RATIO: "3.0"                    # Default limit:request ratio for CPU
  MEMORY_LIMIT_RATIO: "3.0"                 # Default limit:request ratio for memory
  MIN_CPU_REQUEST: "10m"                    # Minimum CPU request
  MIN_MEMORY_REQUEST: "32Mi"                # Minimum memory request
  CRITICAL_NAMESPACES: |                    # Critical namespaces for VPA
    openshift-monitoring
    openshift-ingress
    openshift-apiserver
  PROMETHEUS_URL: "http://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091"

Environment Variables

KUBECONFIG: Path to kubeconfig (used in development)
PROMETHEUS_URL: Prometheus URL
CPU_LIMIT_RATIO: CPU limit:request ratio
MEMORY_LIMIT_RATIO: Memory limit:request ratio
MIN_CPU_REQUEST: Minimum CPU request
MIN_MEMORY_REQUEST: Minimum memory request

📊 Usage

API Endpoints

Cluster Status

GET /api/v1/cluster/status

Namespace Status

GET /api/v1/namespace/{namespace}/status

Validations

GET /api/v1/validations?namespace=default&severity=error

Historical Analysis

GET /api/v1/namespace/{namespace}/workload/{workload}/historical-analysis?time_range=24h

Workload Metrics with PromQL Queries

GET /api/v1/workloads/{namespace}/{workload}/metrics?time_range=24h

Export Report

POST /api/v1/export
Content-Type: application/json

{
  "format": "csv",
  "namespaces": ["default", "kube-system"],
  "includeVPA": true,
  "includeAnalysis": true
}

Usage Examples

1. Check Cluster Status

curl https://your-route-url/api/v1/cluster/status

2. Export CSV Report

curl -X POST https://your-route-url/api/v1/export \
  -H "Content-Type: application/json" \
  -d '{"format": "csv", "includeAnalysis": true}'

3. View Critical Validations

curl "https://your-route-url/api/v1/validations?severity=critical"

🔍 Implemented Validations

1. Required Requests

Problem: Pods without defined requests
Severity: Error
Recommendation: Define CPU and memory requests

2. Recommended Limits

Problem: Pods without defined limits
Severity: Warning
Recommendation: Define limits to avoid excessive consumption

3. Limit:Request Ratio

Problem: Ratio too high or low
Severity: Warning/Error
Recommendation: Adjust to 3:1 ratio
Details: Shows specific request and limit values (e.g., "Request: 100m, Limit: 500m")

4. Minimum Values

Problem: Requests too low
Severity: Warning
Recommendation: Increase to minimum values

5. Overcommit

Problem: Requests exceed cluster capacity
Severity: Critical
Recommendation: Reduce requests or add nodes

6. Insufficient Historical Data

Problem: Workloads with limited historical data for analysis
Severity: Warning
Recommendation: Wait for more data points or enable VPA for new workloads

7. Seasonal Pattern Detection

Problem: Workloads with unpredictable usage patterns
Severity: Info
Recommendation: Consider VPA for dynamic resource adjustments

📈 Reports

JSON Format

{
  "timestamp": "2024-01-15T10:30:00Z",
  "total_pods": 150,
  "total_namespaces": 25,
  "total_nodes": 3,
  "validations": [...],
  "vpa_recommendations": [...],
  "summary": {
    "total_validations": 45,
    "critical_issues": 5,
    "warnings": 25,
    "errors": 15
  }
}

CSV Format

Pod Name,Namespace,Container Name,Validation Type,Severity,Message,Recommendation
pod-1,default,nginx,missing_requests,error,Container without defined requests,Define CPU and memory requests

🔐 Security

RBAC

The application uses a dedicated ServiceAccount with minimal permissions:

Pods: get, list, watch, patch, update
Namespaces: get, list, watch
Nodes: get, list, watch
VPA: get, list, watch
Deployments/ReplicaSets: get, list, watch, patch, update

Security Context

Runs as non-root user (OpenShift assigns UID automatically)
Uses SecurityContext with runAsNonRoot: true
Limits resources with requests/limits
Cluster-agnostic security context

🐛 Troubleshooting

Check Logs

oc logs -f deployment/resource-governance -n resource-governance

Check Pod Status

oc get pods -n resource-governance
oc describe pod <pod-name> -n resource-governance

Check RBAC

oc auth can-i get pods --as=system:serviceaccount:resource-governance:resource-governance-sa

Test Connectivity

# Health check
curl https://your-route-url/health

# API test
curl https://your-route-url/api/v1/cluster/status

🚀 Development

Run Locally

# Install dependencies
pip install -r requirements.txt

# Run application
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8080

Run with Podman (Recommended)

# Build and push to Quay.io
./scripts/build-and-push.sh

# Deploy to OpenShift
./scripts/deploy-complete.sh

Available Scripts

# Essential scripts (only 4 remaining after cleanup)
./setup.sh                    # Initial environment setup
./scripts/build-and-push.sh   # Build and push to Quay.io
./scripts/deploy-complete.sh  # Complete OpenShift deployment
./scripts/undeploy-complete.sh # Complete application removal

Tests

# Test import
python -c "import app.main; print('OK')"

# Test API
curl http://localhost:8080/health

🆕 Recent Updates

Latest Version (v2.0.0) - PatternFly UI Revolution

🎨 Complete UI Overhaul:

✅ PatternFly Design System: Modern, enterprise-grade UI components
✅ Smart Recommendations Gallery: Individual workload cards with bulk selection
✅ VPA CRD Integration: Real Kubernetes API for Vertical Pod Autoscaler management
✅ Application Branding: "ORU Scanner" - OpenShift Resource Usage Scanner
✅ Resource Utilization Formatting: Human-readable percentages (1 decimal place)
✅ Quay.io Registry: Migrated from Docker Hub to Quay.io for better reliability

🔧 Infrastructure Improvements:

✅ GitHub Actions: Automated build and push to Quay.io
✅ Script Cleanup: Removed 19 obsolete scripts, kept only essential ones
✅ Codebase Organization: Clean, maintainable code structure
✅ Documentation: Updated all documentation files

🚀 Deployment Ready:

✅ Zero Downtime: Rolling updates with proper health checks
✅ Cluster Agnostic: Works on any OpenShift 4.x cluster
✅ Production Tested: Deployed on OCP 4.15, 4.18, and 4.19

📝 Roadmap

🎯 PRAGMATIC ROADMAP - Resource Governance Focus

Core Mission: List projects without requests/limits + provide smart recommendations based on historical analysis + VPA integration

Phase 0: UI/UX Simplification (COMPLETED ✅)

0.1 Interface Simplification

Group similar validations in a single card
Show only essential in main view
Technical details in modal or expandable section
Color coding: 🔴 Critical, 🟡 Warning, 🔵 Info
Specific icons: ⚡ CPU, 💾 Memory, 📊 Ratio
Collapsible cards to reduce visual pollution

0.2 Improve Visual Hierarchy

Pragmatic dashboard with single view
Direct actions: "Analyze" and "Fix" buttons
Problem Summary table showing namespace issues
Modal-based analysis for detailed views
Professional interface without browser alerts

0.3 Advanced Features

Modal-based analysis for detailed problem inspection
Detailed pod and container analysis with recommendations
Namespace comparison through Problem Summary table

Phase 1: Enhanced Validation & Categorization (COMPLETED ✅)

1.1 Smart Resource Detection

Enhanced Validation Engine
- Better categorization of resource issues (missing requests, missing limits, wrong ratios)
- Severity scoring based on impact and risk
- Detailed analysis of pod and container resource configurations
Workload Analysis System
- Problem Identification: Namespaces with resource configuration issues
- Detailed Analysis: Pod-by-pod breakdown with container details
- Issue Categorization: Missing requests, missing limits, wrong ratios
- Recommendations: Clear guidance on how to fix each issue

1.2 Historical Analysis Integration

Smart Historical Analysis
- Use historical data to suggest realistic requests/limits
- Calculate P95/P99 percentiles for recommendations
- Identify seasonal patterns and trends
- Flag workloads with insufficient historical data
- Real numerical consumption data with cluster percentages
- OpenShift-specific Prometheus queries for better accuracy
- Workload selector with time ranges (1h, 6h, 24h, 7d)
- Simulated data fallback for demonstration
- PromQL query display for validation in OpenShift console

1.3 Cluster Overcommit Analysis

Real-time Overcommit Monitoring
- CPU and Memory capacity vs requests analysis
- Detailed tooltips with capacity, requests, and available resources
- Modal-based detailed breakdown of overcommit calculations
- Resource utilization tracking
- Professional UI with info icons and modal interactions

Phase 2: Smart Recommendations Engine (COMPLETED ✅)

2.1 Recommendation Dashboard

Dedicated Recommendations Section
- Replaced generic "VPA Recommendations" with "Smart Recommendations"
- PatternFly Service Card gallery with individual workload cards
- Bulk selection functionality for batch operations
- Priority-based visual indicators and scoring

2.2 Recommendation Types

Resource Configuration Recommendations
- "Add CPU requests: 200m (based on 7-day P95 usage)"
- "Increase memory limits: 512Mi (current usage peaks at 400Mi)"
- "Fix CPU ratio: 3:1 instead of 5:1 (current: 500m limit, 100m request)"
VPA Activation Recommendations
- "Activate VPA for new workload 'example' (insufficient historical data)"
- "Enable VPA for outlier workload 'high-cpu-app' (unpredictable usage patterns)"

2.3 Priority Scoring System

Impact-Based Prioritization
- Critical: Missing limits on high-resource workloads
- High: Missing requests on production workloads
- Medium: Suboptimal ratios on established workloads
- Low: New workloads needing VPA activation

2.4 VPA CRD Integration

Real Kubernetes API Integration
- Direct VPA CRD management using Kubernetes CustomObjectsApi
- VPA creation, listing, and deletion functionality
- Real-time VPA status and recommendations
- YAML generation and application capabilities

Phase 3: VPA Integration & Automation (MEDIUM TERM - 3-4 weeks)

3.1 VPA Detection & Management

VPA Status Detection
- Detect existing VPAs in cluster
- Show VPA health and status
- Display current VPA recommendations
- Compare VPA suggestions with current settings

3.2 Smart VPA Activation

Automatic VPA Suggestions
- Suggest VPA activation for new workloads (< 7 days)
- Recommend VPA for outlier workloads
- Provide VPA YAML configurations
- Show estimated benefits of VPA activation

3.3 VPA Recommendation Integration

VPA Data Integration
- Fetch VPA recommendations from cluster
- Compare VPA suggestions with historical analysis
- Show confidence levels for recommendations
- Display VPA update modes and policies

Phase 4: Action Planning & Implementation (LONG TERM - 4-6 weeks)

4.1 Action Plan Generation

Step-by-Step Action Plans
- Generate specific kubectl/oc commands
- Show before/after resource configurations
- Estimate implementation time and effort
- Provide rollback procedures

4.2 Implementation Tracking

Progress Monitoring
- Track which recommendations have been implemented
- Show improvement metrics after changes
- Alert on new issues or regressions
- Generate implementation reports

4.3 Advanced Analytics

Cost Optimization Insights
- Show potential cost savings from recommendations
- Identify over-provisioned resources
- Suggest right-sizing opportunities
- Display resource utilization trends

Phase 5: Enterprise Features (FUTURE - 6+ weeks)

5.1 Advanced Governance

Policy Enforcement
- Custom resource policies per namespace
- Automated compliance checking
- Policy violation alerts
- Governance reporting

5.2 Multi-Cluster Support

Cross-Cluster Analysis
- Compare resource usage across clusters
- Centralized recommendation management
- Cross-cluster best practices
- Unified reporting

🎯 IMMEDIATE NEXT STEPS (This Week)

Priority 1: Enhanced Validation Engine

Improve Resource Detection
- Better categorization of missing requests/limits
- Add workload age detection
- Implement severity scoring
Smart Categorization
- New workloads (< 7 days) → VPA candidates
- Established workloads (> 7 days) → Historical analysis
- Outlier workloads → Special attention needed

Priority 2: Recommendation Dashboard

Create Recommendations Section
- Replace generic VPA section
- Show actionable insights
- Display priority levels
Historical Analysis Integration
- Use Prometheus data for recommendations
- Calculate realistic resource suggestions
- Show confidence levels

Priority 3: VPA Integration

VPA Detection
- Find existing VPAs in cluster
- Show VPA status and health
- Display current recommendations
Smart VPA Suggestions
- Identify VPA candidates
- Generate VPA configurations
- Show estimated benefits

🤝 Contributing

Fork the project
Create a branch for your feature (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

This project is under the MIT license. See the LICENSE file for details.

📞 Support

For support and questions:

Open an issue on GitHub
Consult OpenShift documentation
Check application logs