2025-12-17 10:33:36 +00:00
2025-12-17 10:33:36 +00:00

Gitea High Availability Best Practices

Overview

This document outlines the requirements, best practices, and implementation details for deploying Gitea in a highly available (HA) configuration on Google Kubernetes Engine (GKE).

What is High Availability for Gitea?

High Availability ensures that your Gitea instance remains operational and accessible even when individual components fail. A truly HA Gitea deployment requires redundancy and fault tolerance across all critical components:

  • Application Layer: Multiple Gitea replicas
  • Data Layer: Persistent storage with high availability
  • Database Layer: PostgreSQL with replication and failover
  • Cache Layer: Redis/Valkey with clustering
  • Load Balancing: Distribution of traffic across replicas

Requirements for True HA Gitea Deployment

1. Multiple Gitea Replicas

  • Deploy at least 3 Gitea instances for redundancy
  • Use Kubernetes Deployments with multiple replicas
  • Configure pod anti-affinity to spread pods across nodes/zones
  • Implement health checks (readiness and liveness probes)

2. Shared Persistent Storage

Gitea requires shared storage accessible by all replicas for:

  • Git repositories
  • LFS (Large File Storage) objects
  • Avatars and attachments
  • Custom assets

Storage Options:

  • ReadWriteMany (RWX) volumes: Required for multiple pods to access simultaneously
  • Object storage: S3-compatible storage (GCS, AWS S3, MinIO)
  • Network file systems: NFS, GlusterFS, or cloud-provided solutions

3. External Database (PostgreSQL)

Why External Database?

  • Performance: Dedicated resources for database operations
  • Reliability: Built-in backup, replication, and failover mechanisms
  • Scalability: Independent scaling from application layer
  • Management: Automated maintenance, patching, and monitoring
  • Resource Isolation: Prevents database load from affecting Gitea pods

HA Requirements:

  • Primary-replica replication
  • Automatic failover capability
  • Regular automated backups
  • Point-in-time recovery (PITR)
  • Connection pooling

4. External Cache Layer (Redis/Valkey)

Why External Cache?

  • Performance: Reduced latency for session and cache operations
  • Consistency: Shared cache state across all Gitea replicas
  • Reliability: Cluster mode with replication and automatic failover
  • Resource Efficiency: Prevents memory pressure on Gitea pods
  • Monitoring: Dedicated observability for cache performance

HA Requirements:

  • Cluster mode with multiple nodes
  • Replication for data redundancy
  • Automatic failover and resharding
  • Persistence configuration (AOF/RDB) for data durability

5. Load Balancing

  • Kubernetes Service with proper session affinity configuration
  • Istio VirtualService for advanced traffic routing and management
  • Service mesh capabilities for observability and security
  • Health check integration with load balancer

6. Infrastructure Considerations

  • Multi-zone deployment for regional fault tolerance
  • Appropriate resource requests and limits
  • Pod Disruption Budgets (PDB) to maintain availability during updates
  • Network policies for security

Current Implementation Architecture

Deployment Overview

Our Gitea HA deployment on GKE leverages a hybrid approach combining Kubernetes-native features with GCP-managed services for optimal performance, reliability, and operational efficiency.

Component Architecture

1. Gitea Application Layer (GKE)

Deployment Configuration:

  • Deployed as Kubernetes Deployment with multiple replicas
  • Configured with pod anti-affinity for distribution across nodes
  • Health checks configured for automatic recovery
  • Horizontal Pod Autoscaling (HPA) enabled for automatic scaling based on resource metrics
  • Service exposed through Istio VirtualService for advanced traffic management and routing

2. Persistent Storage (GCS with Fuse Driver)

Implementation:

  • Storage Backend: Google Cloud Storage (GCS) bucket
  • Mount Method: GCS Fuse driver as Persistent Volume (PV)
  • Access Mode: ReadWriteMany (RWX) for multi-pod access

Benefits:

  • Highly durable object storage (99.999999999% durability)
  • Unlimited scalability without pre-provisioning
  • Regional/multi-regional replication built-in
  • Cost-effective for large repositories
  • No storage capacity management required

3. Database Layer (Cloud SQL PostgreSQL HA)

Why Cloud SQL Instead of GKE-Hosted PostgreSQL:

The Gitea official Helm chart explicitly recommends using external managed database services. Here's why we chose Cloud SQL:

Performance & High Availability:

  • Dedicated compute, memory, and optimized disk I/O without resource contention with GKE workloads
  • Automatic failover to standby replica (typically <60 seconds) with synchronous replication for zero data loss
  • Regional redundancy with automatic zone placement and built-in backup with point-in-time recovery
  • Connection pooling and query optimization built-in

Operational Efficiency & Cluster Optimization:

  • Automated security patches, storage scaling, and monitoring with Cloud Monitoring integration
  • Eliminates need for PostgreSQL operators or StatefulSets, reducing GKE cluster complexity
  • Prevents database from consuming cluster resources, allowing GKE to focus on stateless application workloads
  • Better cost optimization through independent scaling controls and reduced operational overhead

4. Cache Layer (Valkey Cluster Mode - GCP Managed)

Why GCP Managed Valkey Instead of GKE-Hosted Redis/Valkey:

Similar to PostgreSQL, the Gitea Helm chart recommends external cache services. Our choice of GCP Managed Redis (Valkey) provides:

Performance & High Availability:

  • Sub-millisecond latency with dedicated memory and network resources, eliminating memory pressure on GKE nodes
  • Cluster mode with automatic sharding, multiple replicas per shard, and automatic failover within seconds
  • Data persistence with AOF and RDB snapshots, plus cross-zone replication for regional resilience

Operational Efficiency & Cluster Optimization:

  • Automated scaling (memory and throughput), backups, security patching, and performance insights
  • Eliminates need for Redis/Valkey operators or StatefulSets, simplifying GKE resource planning
  • Prevents cache memory usage from impacting Gitea pods and eliminates risk of cache eviction due to node memory pressure
  • Independent scaling of cache layer reduces overall GKE cluster size requirements

Implementation Benefits Summary

Why This Architecture?

Alignment with Best Practices:

  • Follows Gitea official Helm chart recommendations for external services
  • Implements industry-standard HA patterns
  • Leverages managed services where appropriate

Reliability:

  • Multiple layers of redundancy across all components
  • Automatic failover for database and cache layers
  • Resilient storage with built-in replication
  • No single point of failure

Performance:

  • Dedicated resources for each layer (compute, database, cache)
  • Optimized I/O paths for each service type
  • Reduced latency through managed service optimization
  • No resource contention within GKE cluster

Operational Efficiency:

  • Reduced operational burden through managed services
  • Simplified GKE cluster management
  • Automated maintenance and patching
  • Better observability with native GCP monitoring

Scalability:

  • Independent scaling of application, database, and cache layers
  • Unlimited storage capacity with GCS
  • Elastic compute resources on GKE
  • Predictable performance under load

Cost Optimization:

  • Pay-for-what-you-use with managed services
  • No over-provisioning of GKE cluster resources
  • Efficient resource utilization across layers
  • Reduced operational costs (less manual management)

Recommendation Rationale

Following Gitea Helm Chart Guidance

The official Gitea Helm chart documentation explicitly states:

"For production deployments, it is highly recommended to use external PostgreSQL and Redis/Valkey services rather than the built-in ones. This ensures better performance, reliability, and easier maintenance."

Our implementation strictly adheres to this guidance by:

  1. External PostgreSQL: Using Cloud SQL PostgreSQL HA instead of in-cluster PostgreSQL
  2. External Cache: Using GCP Managed Valkey in cluster mode instead of in-cluster Redis
  3. Persistent Storage: Using GCS with Fuse driver for shared, durable storage

GKE Cluster Focus

By offloading database and cache to managed services, our GKE cluster can:

  • Focus exclusively on running Gitea application pods
  • Maintain consistent performance without database/cache resource contention
  • Scale independently based on application traffic
  • Remain lighter and more cost-effective
  • Be easier to manage and upgrade

This separation of concerns is a cloud-native best practice that improves overall system reliability and operational efficiency.


Monitoring and Maintenance

Health Checks

  • Gitea pod readiness and liveness probes
  • Cloud SQL connection monitoring
  • Valkey cluster health monitoring
  • GCS bucket access verification

Backup Strategy

  • Cloud SQL automated backups (daily + PITR)
  • GCS bucket versioning and retention policies
  • Regular disaster recovery testing

Scaling Considerations

  • Gitea pod HPA (Horizontal Pod Autoscaler) automatically scales replicas based on CPU/memory metrics
  • Istio VirtualService ensures seamless traffic distribution during scaling events
  • Cloud SQL vertical scaling for database performance
  • Valkey cluster scaling for cache capacity
  • GCS automatically scales with usage

Conclusion

Our Gitea HA deployment implements a production-ready, highly available architecture that follows official recommendations and cloud-native best practices. By leveraging GCP managed services for PostgreSQL and Valkey, we achieve superior reliability, performance, and operational efficiency while keeping the GKE cluster focused on its core responsibility: running Gitea application workloads.

This architecture provides:

  • True high availability with no single point of failure
  • Optimal performance through dedicated resources
  • Simplified operations through managed services
  • Cost-effective scaling at each layer
  • Production-grade reliability and disaster recovery
Description
newnew
Readme 28 KiB