格物致知

Overview

This article starts from a Reddit discussion: Pros and Cons of Daemonset agents vs sidecars.

The original discussion asks whether agents for logs, metrics, security, and similar capabilities in Kubernetes should be deployed as DaemonSets or as sidecar containers inside each application Pod.

In regular agent scenarios, this already involves clear architectural trade-offs. In service mesh, the trade-off becomes sharper. Log or metrics agents are usually out-of-band collectors; when such an agent fails, the main impact is often lost or delayed observability data. But a service mesh proxy usually sits directly on the business traffic path. If the proxy fails, requests themselves may fail.

So the service mesh question is not simply:

Is sidecar better, or is DaemonSet better?

A more accurate question is:

Which capabilities must stay close to the workload?
Which capabilities can move down to the node?
Which capabilities need an independent L7 waypoint or gateway?

DaemonSet vs Sidecar for Regular Agents

In regular agent scenarios, the difference between DaemonSet and sidecar can be understood like this:

Model	Advantages	Disadvantages
DaemonSet agent	One agent per Node, lower resource cost, simpler centralized upgrades, no need to restart application Pods	Usually has higher privileges; one agent failure may affect multiple workloads on the same Node
Sidecar agent	One agent per Pod, resources and configuration can be controlled at Pod granularity, isolation boundary is finer	Every Pod carries an agent, resource overhead is higher, and upgrading the sidecar often requires rolling application Pods

If the agent is node-level, generic, and horizontally collecting data, such as collecting node logs, container runtime metrics, or doing node security scanning, DaemonSet is usually more natural.

If the agent strongly depends on the context, configuration, lifecycle, or isolation boundary of a single business Pod, such as reading local files next to the business process, injecting process-level configuration, or applying fine-grained policies per business instance, sidecar is usually more suitable.

This judgment framework can also be applied to service mesh, but with one important difference: a service mesh proxy is no longer just an "out-of-band observer." It is a "traffic intermediary."

Why Service Mesh Is More Sensitive

The core idea of service mesh is to extract cross-cutting capabilities from service-to-service communication, such as:

mTLS.
Service identity.
Access control.
Traffic governance.
Retries, timeouts, and circuit breaking.
Observability.
L7 routing and policy.

These capabilities must be executed in the data plane proxy. How the proxy is deployed determines where these capabilities are enforced, what the failure radius is, how much resource cost is introduced, and how upgrades are performed.

If a regular logging agent fails, business requests may still continue; only logs may be lost or delayed. If a mesh proxy fails, business requests may fail directly. This is why, when discussing sidecar vs host agent in service mesh, we cannot only ask whether it saves resources. We also need to ask:

Does traffic pass through a shared component?
How many workloads are affected if the shared component fails?
Where are service identity and policy enforced?
Does the L7 capability need to stay close to the business instance?
Does upgrading the proxy require restarting application Pods?

Characteristics of Sidecar Mesh

Traditional service mesh usually uses the sidecar proxy model. In Istio sidecar mode or Linkerd, for example, every business Pod has a proxy container:

Pod
  |-- app container
  `-- mesh proxy sidecar

Inbound and outbound traffic from the application container is redirected to the sidecar, which handles mTLS, traffic governance, access control, telemetry, and other capabilities.

The advantage of this model is that the boundary is clear. Each workload has its own proxy, and the proxy shares lifecycle, network namespace, and part of the context with the business Pod. Both L4 and L7 capabilities can be enforced near the workload boundary, and the policy semantics are intuitive.

Its cost is also obvious:

Every Pod gets an additional proxy, so the number of proxies in the cluster grows linearly with the number of Pods.
CPU, memory, and connection state are duplicated per Pod.
Upgrading the proxy usually means rolling the application Pod.
Injection, initialization, iptables or CNI redirection, and related links increase operational complexity.
Application startup and shutdown also need to consider sidecar lifecycle and traffic draining.

Therefore, sidecar mesh is well suited for scenarios that need strong isolation, complete L7 capabilities, and independent governance per workload. But in large clusters, its resource and upgrade costs become increasingly visible.

Characteristics of Host Agent or Node-level Proxy

If the proxy is moved out of each Pod and changed to one shared proxy per Node, the model becomes close to a DaemonSet agent:

Node
  |-- mesh node proxy
  |-- Pod A
  |-- Pod B
  `-- Pod C

The benefits are straightforward:

The number of proxies drops from one per Pod to one per Node.
Resource utilization is higher.
Proxy upgrades do not necessarily require restarting application Pods.
Application Pods do not need sidecar injection, so onboarding is lighter.
Node-local invocation becomes easier. If two frequently communicating service instances are scheduled on the same Node, the node-level proxy can prefer local endpoints, reducing cross-node network hops and latency.

The last point is easy to miss. Sidecar mode can certainly also support locality-aware load balancing, but a node-level proxy naturally sits at the Node boundary and can more easily know which local endpoints exist on the current Node. If the scheduler also uses pod affinity, topology spread, and similar policies to place frequently interacting service instances on the same set of nodes, the data plane can prefer same-node calls. This reduces cross-node forwarding, lowers latency, and may also optimize NIC usage, overlay encapsulation, and cross-zone traffic cost.

But this is not a capability automatically granted by using a host agent. At least three conditions must hold at the same time: the scheduler can place strongly related instances close to each other; the proxy or routing layer can understand which Node each endpoint belongs to; and the policy allows "prefer local, then fall back to remote when needed." If the policy becomes "local endpoints only," then when the local Node has no healthy endpoint, availability may suffer.

The problems are equally clear:

A node-level proxy is a shared component; if it fails, multiple workloads on the same Node may be affected.
Identity, policy, and traffic for different workloads must be strictly isolated inside the shared proxy.
If all L7 capabilities are placed in the node proxy, complexity and security boundaries become hard to control.
The traffic path and troubleshooting model may be less intuitive than in sidecar mode.

So a node-level proxy is not a simple replacement for sidecar. It is more suitable for generic, foundational, lower-level capabilities, such as L4 connection handling, mTLS tunnels, basic telemetry, and basic authorization. The closer a capability gets to business-level L7 semantics, the more carefully its scope needs to be defined.

The Layered Idea Behind Ambient Mesh

Istio ambient mode can be seen as a systematic answer to this trade-off. According to Istio's official documentation, ambient mode splits the data plane into two layers:

ztunnel:
  Node-level L4 proxy that handles mTLS, L4 authorization, and basic telemetry

waypoint proxy:
  Optional L7 proxy that handles HTTP routing, L7 policy, L7 telemetry, and similar capabilities

In other words, ambient mode does not put every capability into a single node-level proxy. It introduces layering:

Foundational L4 security and encryption capabilities are moved down to the ztunnel on each Node.
Capabilities that require L7 semantics are enabled through waypoint proxies.
Application Pods do not need sidecar injection, and they do not need to restart proxy containers just to join the mesh.

This model tries to preserve the core capabilities of sidecar mesh while reducing the cost of running a full proxy in every Pod.

But ambient is not a free lunch. It turns the problem from "one sidecar per Pod" into "node-level ztunnel plus optional waypoint as a layered governance model." Users need to understand:

Which capabilities only require L4.
Which policies depend on L7.
Whether waypoint should be deployed per namespace, per service, or at a finer granularity.
When failures occur, whether the problem is in ztunnel, waypoint, CNI redirection, or the business application itself.

Comparing the Three Models

The deployment models of service mesh proxies can be summarized into three categories:

Model	Proxy Location	Suitable Capabilities	Main Advantages	Main Costs
Sidecar mesh	One proxy per Pod	Full L4 + L7 capabilities	Clear workload boundary, strong isolation, intuitive semantics	Many proxies, high resource cost, upgrades often require rolling application Pods
Node-level proxy	One shared proxy per Node	L4, basic security, basic telemetry	Lower resource cost, lighter onboarding, more independent proxy upgrades, and easier node-local invocation	Larger failure radius, high isolation requirements inside the shared proxy
Ambient / waypoint	L4 at node level, L7 through on-demand waypoints	L4 enabled by default, L7 enabled on demand	Separates foundational and advanced capabilities, reducing the number of sidecars	More complex model; users must understand the boundary between ztunnel and waypoint

None of these models is absolutely better than the others. They represent different engineering trade-offs.

How to Choose

If the business needs complete L7 capabilities and strongly values an independent governance boundary for every workload, sidecar mesh remains the more intuitive and mature choice.

If the main requirement is default encryption, basic service identity, basic authorization, and basic telemetry, rather than complex L7 governance for every request, then a node-level proxy or ambient mode becomes more attractive.

If the cluster is very large and the number of Pods is high, the CPU, memory, and upgrade costs introduced by sidecars become explicit. In that case, it is worth considering moving foundational capabilities down into a shared data plane and enabling L7 capabilities only when needed.

If the system contains service combinations that call each other very frequently, "scheduling affinity plus node-local endpoint preference" can also be used as an optimization. First, place these service instances as close as possible on the same Node or within the same topology domain. Then let the node-level proxy prefer local healthy instances. This optimization is suitable for low-latency, high-throughput paths with relatively stable call relationships, but it should not break failover.

If organizational boundaries, tenant isolation, or security auditing are strict, resource cost should not be the only consideration. The blast radius and policy isolation difficulty introduced by shared proxies must be evaluated during architecture design.

A practical decision rule is:

Capabilities that must stay close to the workload:
  prefer sidecar

Generic L4 security and connection capabilities:
  prefer node-level proxy

L7 capabilities needed only by some services:
  prefer on-demand waypoint / gateway

Summary

The service mesh discussion is indeed similar to the DaemonSet agent vs sidecar discussion, but the two cannot be copied mechanically.

Regular agents are usually out-of-band components. Service mesh proxies sit on the traffic path. When an out-of-band component optimizes for resource efficiency, sharing can be more aggressive. Once a traffic intermediary becomes shared, identity isolation, policy boundaries, failure radius, and observability must be handled carefully.

Therefore, sidecar, host agent, and ambient are not three slogans that simply replace one another. They are three different ways to place capabilities:

sidecar:
  capabilities stay close to the workload, isolation is strong, cost is high

host agent / node proxy:
  capabilities move down to the node, resources are saved, node-local invocation is easier, but the failure radius is larger

ambient / waypoint:
  L4 is shared by default, L7 is enabled on demand, layering balances cost and capability

In the long run, the evolution of service mesh is probably not about "getting rid of sidecars entirely." It is more likely about placing different layers of capability in different positions: foundational capabilities become shared, advanced capabilities become on-demand, and strong-isolation scenarios still keep a proxy close to the workload.

References

Reddit: Pros and Cons of Daemonset agents vs sidecars
Istio: Sidecar or ambient?
Istio: Ambient Mesh Overview
Istio: Configure waypoint proxies
Istio: Traffic Distribution
Linkerd: Architecture

Service Mesh: Sidecar or Host Agent?