Solving EKS Fargate Logging: A Loki-Based Approach for Serverless Kubernetes

9 minute read

In early 2020, AWS EKS Fargate was gaining adoption for serverless Kubernetes workloads, but it came with a significant limitation: traditional logging approaches didn’t work. Without access to host-level log files and before AWS Firelens support was available, getting logs out of Fargate pods was challenging. This post details the lightweight solution I built using Grafana Loki.

The EKS Fargate Logging Challenge

Traditional Kubernetes Logging Approaches

In standard Kubernetes deployments, you typically have several logging options:

Node-level logging agents (like Fluentd/Fluent Bit) reading from /var/log
Sidecar containers with shared volumes
Direct application logging to external systems

Fargate’s Limitations

EKS Fargate introduced constraints that broke these patterns:

No Host Access: Fargate pods run in isolated environments without access to host-level log directories
No Persistent Storage: Limited volume mounting options
No Node Agents: Can’t run DaemonSets for log collection
Immutable Infrastructure: Pods are ephemeral with no persistent logging infrastructure

The Missing Piece

As of February 2020, AWS Firelens (the now-standard solution) wasn’t supported on EKS Fargate. The AWS containers roadmap showed it was planned, but teams needed logging solutions immediately.

Available Workarounds and Their Problems

Sidecar Approach

# Traditional sidecar logging
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: app
        image: myapp:latest
        volumeMounts:
        - name: logs
          mountPath: /var/log/app
      - name: log-shipper
        image: fluent/fluent-bit:latest
        volumeMounts:
        - name: logs
          mountPath: /var/log/app

Problems:

Resource Overhead: Every pod needs an additional container
Maintenance Burden: Log shipper updates across all applications
Configuration Complexity: Per-application log parsing rules
Cost Impact: 2x container count increases Fargate costs significantly

Application-Level Logging

# Direct logging from application
import logging
import requests

# Send logs directly to external service
def send_log(message):
    requests.post("https://logs.company.com/api/logs", json={"message": message})

Problems:

Intrusive Changes: Requires modifying all applications
Dependency Risk: Applications become tightly coupled with logging infrastructure
Development Overhead: Every team needs logging expertise
Failure Handling: Applications must handle logging service outages

Solution: Namespace-Level Log Aggregation

Instead of per-pod logging, I developed a namespace-level approach using a single log aggregation pod per namespace.

Architecture Overview

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   App Pod 1     │    │                  │    │                 │
│   (stdout/err)  │───▶│  Log Aggregator  │───▶│   Grafana Loki  │
└─────────────────┘    │     Pod          │    │   (Remote)      │
┌─────────────────┐    │                  │    └─────────────────┘
│   App Pod 2     │───▶│  - Watches pods  │
│   (stdout/err)  │    │  - Collects logs │
└─────────────────┘    │  - Ships to Loki │
┌─────────────────┐    └──────────────────┘
│   App Pod N     │───▶
│   (stdout/err)  │
└─────────────────┘

Key Design Principles

Minimal Intrusion: No changes to existing applications
Namespace Isolation: One log aggregator per namespace
Standard Outputs: Leverage Kubernetes’ built-in log collection
Cost Efficient: Single additional pod vs sidecar per pod
Easy Maintenance: Centralized log shipping configuration

Implementation Details

Log Aggregator Deployment

# k8s/log-aggregator-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: log-aggregator
  namespace: production  # Deploy per namespace
spec:
  replicas: 1
  selector:
    matchLabels:
      app: log-aggregator
  template:
    metadata:
      labels:
        app: log-aggregator
    spec:
      serviceAccountName: log-aggregator
      containers:
      - name: aggregator
        image: lucidprogrammer/fargate-loki-client:latest
        env:
        - name: LOKI_URL
          value: "https://loki.company.com/loki/api/v1/push"
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: CLUSTER_NAME
          value: "production-eks"
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi

Service Account Configuration

# k8s/rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: log-aggregator
  namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: log-reader
  namespace: production
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: log-aggregator-binding
  namespace: production
subjects:
- kind: ServiceAccount
  name: log-aggregator
  namespace: production
roleRef:
  kind: Role
  name: log-reader
  apiGroup: rbac.authorization.k8s.io

Log Collection Logic

The core aggregator implementation:

#!/usr/bin/env python3
"""
EKS Fargate Loki Log Aggregator
Collects logs from all pods in a namespace and ships to Loki
"""

import os
import time
import json
import requests
import logging
from datetime import datetime
from kubernetes import client, config, watch
from threading import Thread
import queue

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class FargateLokiClient:
    def __init__(self):
        # Load Kubernetes config (in-cluster)
        config.load_incluster_config()
        self.v1 = client.CoreV1Api()
        
        # Configuration from environment
        self.loki_url = os.getenv('LOKI_URL', 'http://loki:3100/loki/api/v1/push')
        self.namespace = os.getenv('NAMESPACE', 'default')
        self.cluster_name = os.getenv('CLUSTER_NAME', 'unknown')
        self.batch_size = int(os.getenv('BATCH_SIZE', '100'))
        self.batch_timeout = int(os.getenv('BATCH_TIMEOUT', '5'))
        
        # Log batching
        self.log_queue = queue.Queue()
        self.batch_thread = Thread(target=self._batch_processor, daemon=True)
        self.batch_thread.start()
        
        logger.info(f"Started Fargate Loki client for namespace: {self.namespace}")

    def start_log_collection(self):
        """Start watching pods and collecting logs"""
        logger.info("Starting log collection...")
        
        # Get initial pod list
        pods = self.v1.list_namespaced_pod(namespace=self.namespace)
        for pod in pods.items:
            if pod.status.phase == 'Running':
                self._start_pod_log_stream(pod)
        
        # Watch for new pods
        w = watch.Watch()
        for event in w.stream(self.v1.list_namespaced_pod, namespace=self.namespace):
            pod = event['object']
            event_type = event['type']
            
            if event_type == 'ADDED' and pod.status.phase == 'Running':
                logger.info(f"New pod detected: {pod.metadata.name}")
                self._start_pod_log_stream(pod)

    def _start_pod_log_stream(self, pod):
        """Start log streaming for a specific pod"""
        pod_name = pod.metadata.name
        
        # Skip our own logs to avoid recursion
        if pod_name.startswith('log-aggregator'):
            return
            
        logger.info(f"Starting log stream for pod: {pod_name}")
        
        # Start thread for each container in the pod
        for container in pod.spec.containers:
            thread = Thread(
                target=self._stream_container_logs,
                args=(pod_name, container.name),
                daemon=True
            )
            thread.start()

    def _stream_container_logs(self, pod_name, container_name):
        """Stream logs from a specific container"""
        try:
            # Stream logs with follow=True for real-time collection
            log_stream = self.v1.read_namespaced_pod_log(
                name=pod_name,
                namespace=self.namespace,
                container=container_name,
                follow=True,
                _preload_content=False
            )
            
            for line in log_stream:
                if line:
                    log_entry = {
                        'timestamp': datetime.utcnow().isoformat() + 'Z',
                        'pod': pod_name,
                        'container': container_name,
                        'namespace': self.namespace,
                        'cluster': self.cluster_name,
                        'message': line.decode('utf-8').strip()
                    }
                    self.log_queue.put(log_entry)
                    
        except Exception as e:
            logger.error(f"Error streaming logs for {pod_name}/{container_name}: {e}")

    def _batch_processor(self):
        """Process logs in batches and send to Loki"""
        batch = []
        last_send_time = time.time()
        
        while True:
            try:
                # Get log entry with timeout
                try:
                    log_entry = self.log_queue.get(timeout=1)
                    batch.append(log_entry)
                except queue.Empty:
                    pass
                
                # Send batch if size or time threshold reached
                current_time = time.time()
                if (len(batch) >= self.batch_size or 
                    (batch and (current_time - last_send_time) >= self.batch_timeout)):
                    
                    self._send_to_loki(batch)
                    batch = []
                    last_send_time = current_time
                    
            except Exception as e:
                logger.error(f"Error in batch processor: {e}")

    def _send_to_loki(self, log_entries):
        """Send log entries to Loki"""
        if not log_entries:
            return
            
        # Convert to Loki format
        loki_payload = {"streams": []}
        
        # Group by labels for Loki streams
        streams = {}
        for entry in log_entries:
            labels = {
                'namespace': entry['namespace'],
                'pod': entry['pod'],
                'container': entry['container'],
                'cluster': entry['cluster']
            }
            label_string = ','.join([f'{k}="{v}"' for k, v in labels.items()])
            
            if label_string not in streams:
                streams[label_string] = []
            
            # Loki expects [timestamp_ns, log_line]
            timestamp_ns = str(int(time.time() * 1000000000))
            streams[label_string].append([timestamp_ns, entry['message']])
        
        # Build final payload
        for label_string, values in streams.items():
            loki_payload["streams"].append({
                "stream": dict(item.split('=') for item in label_string.split(',')),
                "values": values
            })
        
        try:
            response = requests.post(
                self.loki_url,
                json=loki_payload,
                headers={'Content-Type': 'application/json'},
                timeout=10
            )
            response.raise_for_status()
            logger.info(f"Sent {len(log_entries)} log entries to Loki")
            
        except requests.RequestException as e:
            logger.error(f"Failed to send logs to Loki: {e}")

if __name__ == "__main__":
    client = FargateLokiClient()
    client.start_log_collection()

Development and Testing with Telepresence

For development, I used Telepresence to iterate quickly:

# Deploy a dummy pod for Telepresence connection
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: log-exporter
  namespace: dev1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: log-exporter
  template:
    metadata:
      labels:
        app: log-exporter
    spec:
      containers:
      - name: placeholder
        image: datawire/telepresence-k8s:0.103
        command: ["/bin/sleep", "3600"]
EOF

# Connect via Telepresence for development
telepresence --namespace dev1 --deployment log-exporter --run-shell

# Now develop locally with cluster access
pip install -r requirements.txt
python fargate_loki_client.py

This approach allowed rapid iteration while maintaining access to the Kubernetes API and network.

Production Deployment Experience

Deployment across Multiple Namespaces

# Deploy to production namespaces
for namespace in production staging dev1 dev2; do
  kubectl create namespace $namespace --dry-run=client -o yaml | kubectl apply -f -
  
  # Deploy log aggregator per namespace
  helm install log-aggregator ./charts/fargate-loki-client \
    --namespace $namespace \
    --set loki.url="https://loki.company.com/loki/api/v1/push" \
    --set cluster.name="production-eks-us-east-1"
done

Resource Utilization

Per-namespace log aggregator resource usage:

CPU: 50-100m average, 200m peak during log bursts
Memory: 64-128Mi average, 256Mi peak for batch processing
Network: 1-5 Mbps depending on log volume

Cost Comparison (February 2020 Fargate pricing):

Sidecar Approach: +100% container cost (2x pods)
Namespace Aggregator: +5-10% cost (1 additional pod per namespace)
Break-even Point: 10+ pods per namespace

Observability Improvements

With logs flowing into Loki, we gained:

Application Debugging:

# Query logs by pod
{cluster="production-eks", namespace="api", pod="user-service-abc123"}

# Find errors across namespace
{cluster="production-eks", namespace="api"} |= "ERROR"

# Monitor deployment rollouts
{cluster="production-eks", namespace="api"} | json | deployment_version != ""

Operational Insights:

# Log volume by container
sum by (container) (rate({cluster="production-eks"}[5m]))

# Error rates by service
sum by (pod) (rate({cluster="production-eks"} |= "ERROR" [5m])) / 
sum by (pod) (rate({cluster="production-eks"}[5m]))

Performance and Scale Analysis

Scaling Characteristics

Log Volume Handled:

Small Namespace (1-5 pods): 100-500 log lines/minute
Medium Namespace (10-20 pods): 1K-5K log lines/minute
Large Namespace (50+ pods): 10K+ log lines/minute

Resource Scaling:

Memory usage scales with batch size and log velocity
CPU usage correlates with log parsing and HTTP requests
Network bandwidth depends on log verbosity and retention

Failure Modes and Resilience

Loki Outage Handling:

def _send_to_loki(self, log_entries):
    try:
        # Send to Loki
        response = requests.post(self.loki_url, json=payload)
        response.raise_for_status()
    except requests.RequestException as e:
        # Fallback: Write to stdout for cluster logging
        for entry in log_entries:
            print(json.dumps(entry))  # Cluster logging can pick this up
        logger.error(f"Loki unavailable, logged to stdout: {e}")

Pod Restart Recovery:

def start_log_collection(self):
    # Resume from current time, don't replay historical logs
    # Kubernetes log streaming starts from current position
    pods = self.v1.list_namespaced_pod(namespace=self.namespace)
    for pod in pods.items:
        if pod.status.phase == 'Running':
            self._start_pod_log_stream(pod)

Migration to AWS Firelens

When AWS Firelens became available on EKS Fargate in late 2020, migration was straightforward:

Before (Custom Aggregator)

# Separate log aggregator deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: log-aggregator
spec:
  # ... custom aggregator config

After (Firelens)

# Native Fargate logging configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        fluentbit.io/parser: json
    spec:
      containers:
      - name: app
        image: myapp:latest
        # Logs automatically forwarded to configured destination

The migration validated that the namespace-level approach was the right architectural choice - it required minimal changes to applications and provided a clean upgrade path.

Lessons Learned

1. Serverless Constraints Drive Innovation

EKS Fargate’s limitations forced creative solutions that were often more elegant than traditional approaches.

2. Namespace-Level Aggregation Scales Well

The pattern of one aggregator per namespace provided the right balance of isolation and efficiency.

3. Batching is Critical for Performance

Real-time log streaming without batching would have overwhelmed both the aggregator and Loki with small HTTP requests.

4. RBAC Scoping Matters

Namespace-scoped service accounts provided appropriate security boundaries for log collection.

5. Development Tools Enable Rapid Iteration

Telepresence was invaluable for developing Kubernetes-native applications locally.

Modern Alternatives and Evolution

As of 2024, the logging landscape has evolved significantly:

AWS Native Solutions:

AWS Firelens (now standard for Fargate)
CloudWatch Container Insights
AWS Distro for OpenTelemetry

Cloud-Native Options:

Fluent Operator for Kubernetes
Vector for high-performance log processing
Grafana Agent for unified observability

Service Mesh Integration:

Istio access logs
Linkerd tap for real-time observability
Envoy proxy statistics

Conclusion

Building this EKS Fargate logging solution taught valuable lessons about working within platform constraints and developing pragmatic solutions for emerging technologies. Key takeaways:

Early adoption often requires custom solutions before native support arrives
Architectural patterns that respect platform boundaries age better than workarounds
Observability gaps can significantly impact debugging and operations
Cost optimization through shared infrastructure pays dividends at scale

While AWS Firelens eventually provided the official solution, this custom approach:

Served production workloads for 12+ months
Enabled early Fargate adoption when logging was a blocker
Provided migration path to native solutions
Demonstrated effective constraint-driven engineering

The complete implementation is available at github.com/lucidprogrammer/eks-fargate-loki-client.

Working with serverless Kubernetes or need custom observability solutions? I’m available for consulting on cloud-native logging and monitoring architectures through Upwork.

Share on

X Facebook LinkedIn Bluesky

John Cyriac