格物致知

Preface

In an earlier article, we analyzed the Flannel networking model. Flannel does solve the problem of network connectivity between containers inside a Kubernetes cluster. However, it cannot solve another common problem: direct connectivity between containers inside the cluster and virtual machines or physical machines outside the cluster.

More precisely, services outside the cluster cannot directly ping container IPs inside the cluster. This means that in service discovery and registration scenarios such as Dubbo-based microservices, a consumer outside the Kubernetes cluster cannot directly reach a provider inside the cluster at the network layer.

A natural question follows: why is Flannel powerless in this scenario?

The reason is that container IPs in a Kubernetes cluster are independently generated by flanneld. They are not part of the VPC CIDR range. As a result, servers outside the cluster do not have the corresponding route entries needed to forward packets to those container IPs.

The solution almost suggests itself: if Pod IPs are allocated from the VPC CIDR, external servers can route to them directly.

That is exactly the idea behind VPC-CNI: allocate IP addresses from the VPC CIDR to containers. In this model, workloads inside and outside the cluster can communicate directly without a separate overlay network boundary. Another benefit is performance: because the Flannel VXLAN encapsulation and decapsulation path is removed, network performance can improve significantly.

When migrating business systems into Kubernetes, especially systems built on RPC and service registries, preserving direct connectivity between in-cluster and out-of-cluster services is often important. In that scenario, VPC-CNI is usually a strong choice.

Principle

The main implementation idea is:

When a worker node starts, it attaches multiple ENIs (Elastic Network Interfaces).
Each ENI has one primary IP and multiple secondary IPs.
ipamd (Local IP Address Manager) runs on every worker node and adds all secondary IPs from all ENIs into a local IP address pool.
When CNI receives a Pod creation request, it asks ipamd for an IP address through gRPC and then configures the Pod network stack. Conversely, when CNI receives a Pod deletion request, it notifies ipamd to release the IP and also tears down the Pod network stack.

Figure: VPC-CNI allocates Pod IPs from ENI secondary addresses

CNI

VPC-CNI follows the Kubernetes CNI network model and mainly implements cmdAdd and cmdDel, which handle Pod network creation and teardown respectively.

cmdAdd code path: cmd/routed-eni-cni-plugin/cni.go

func cmdAdd(args *skel.CmdArgs) error {
	return add(args, typeswrapper.New(), grpcwrapper.New(), rpcwrapper.New(), driver.New())
}

func add(args *skel.CmdArgs, cniTypes typeswrapper.CNITYPES, grpcClient grpcwrapper.GRPC,
	rpcClient rpcwrapper.RPC, driverClient driver.NetworkAPIs) error {

	conf, log, err := LoadNetConf(args.StdinData)
    ...
	// Parse Kubernetes arguments.
    var k8sArgs K8sArgs
	if err := cniTypes.LoadArgs(args.Args, &k8sArgs); err != nil {
		log.Errorf("Failed to load k8s config from arg: %v", err)
		return errors.Wrap(err, "add cmd: failed to load k8s config from arg")
	}
    ...
	// Send a gRPC request to the ipamd server.
	conn, err := grpcClient.Dial(ipamdAddress, grpc.WithInsecure())
	...
	c := rpcClient.NewCNIBackendClient(conn)
    
    // Call ipamd's AddNetwork API to obtain an IP address.
	r, err := c.AddNetwork(context.Background(),
		&pb.AddNetworkRequest{
			ClientVersion:              version,
			K8S_POD_NAME:               string(k8sArgs.K8S_POD_NAME),
			K8S_POD_NAMESPACE:          string(k8sArgs.K8S_POD_NAMESPACE),
			K8S_POD_INFRA_CONTAINER_ID: string(k8sArgs.K8S_POD_INFRA_CONTAINER_ID),
			Netns:                      args.Netns,
			ContainerID:                args.ContainerID,
			NetworkName:                conf.Name,
			IfName:                     args.IfName,
		})
    ...
	addr := &net.IPNet{
		IP:   net.ParseIP(r.IPv4Addr),
		Mask: net.IPv4Mask(255, 255, 255, 255),
	}
    ...
    // After obtaining the IP, call the driver module to configure the Pod network namespace.
	err = driverClient.SetupNS(hostVethName, args.IfName, args.Netns, addr, int(r.DeviceNumber), r.VPCcidrs, r.UseExternalSNAT, mtu, log)
	}
    ...
	ips := []*current.IPConfig{
		{
			Version: "4",
			Address: *addr,
		},
	}

	result := &current.Result{
		IPs: ips,
	}

	return cniTypes.PrintResult(result, conf.CNIVersion)
}

In short, CNI requests an IP address from ipamd through gRPC. After receiving the IP, it calls the driver module to set up the Pod networking environment.

cmdDel

cmdDel releases the Pod IP and cleans up the Pod networking environment.

func cmdDel(args *skel.CmdArgs) error {
	return del(args, typeswrapper.New(), grpcwrapper.New(), rpcwrapper.New(), driver.New())
}

func del(args *skel.CmdArgs, cniTypes typeswrapper.CNITYPES, grpcClient grpcwrapper.GRPC, rpcClient rpcwrapper.RPC,
	driverClient driver.NetworkAPIs) error {

	conf, log, err := LoadNetConf(args.StdinData)
    ...
	var k8sArgs K8sArgs
	if err := cniTypes.LoadArgs(args.Args, &k8sArgs); err != nil {
		log.Errorf("Failed to load k8s config from args: %v", err)
		return errors.Wrap(err, "del cmd: failed to load k8s config from args")
	}
	// Send a gRPC request to notify ipamd to release the IP.
	conn, err := grpcClient.Dial(ipamdAddress, grpc.WithInsecure())
	...
	c := rpcClient.NewCNIBackendClient(conn)

	r, err := c.DelNetwork(context.Background(), &pb.DelNetworkRequest{
		ClientVersion:              version,
		K8S_POD_NAME:               string(k8sArgs.K8S_POD_NAME),
		K8S_POD_NAMESPACE:          string(k8sArgs.K8S_POD_NAMESPACE),
		K8S_POD_INFRA_CONTAINER_ID: string(k8sArgs.K8S_POD_INFRA_CONTAINER_ID),
		NetworkName:                conf.Name,
		ContainerID:                args.ContainerID,
		IfName:                     args.IfName,
		Reason:                     "PodDeleted",
	})
	...
	deletedPodIP := net.ParseIP(r.IPv4Addr)
	if deletedPodIP != nil {
		addr := &net.IPNet{
			IP:   deletedPodIP,
			Mask: net.IPv4Mask(255, 255, 255, 255),
		}
		...
        // Call the driver's TearDownNS API to clean up the Pod network stack.
		err = driverClient.TeardownNS(addr, int(r.DeviceNumber), log)
        ...
	return nil
}

driver

The driver module provides the tools for creating and tearing down the Pod network stack. Its main functions are SetupNS and TeardownNS.

Code path: cmd/routed-eni-cni-plugin/driver.go

Code flow:

Figure: Driver flow for setting up the Pod network namespace

SetupNS

The main responsibility of this function is to configure the Pod network stack, including preparing the Pod network environment and configuring policy routing.

In the AWS CNI networking model, each ENI on the node has a corresponding route table for forwarding traffic from Pods. Policy routing is used so that traffic to Pods prefers the main route table, while traffic from Pods uses the route table associated with the corresponding ENI. Therefore, setting up the Pod network environment also includes configuring policy routing.

func (os *linuxNetwork) SetupNS(hostVethName string, contVethName string, netnsPath string, addr *net.IPNet, deviceNumber int, vpcCIDRs []string, useExternalSNAT bool, mtu int, log logger.Logger) error {
	log.Debugf("SetupNS: hostVethName=%s, contVethName=%s, netnsPath=%s, deviceNumber=%d, mtu=%d", hostVethName, contVethName, netnsPath, deviceNumber, mtu)
	return setupNS(hostVethName, contVethName, netnsPath, addr, deviceNumber, vpcCIDRs, useExternalSNAT, os.netLink, os.ns, mtu, log, os.procSys)
}


func setupNS(hostVethName string, contVethName string, netnsPath string, addr *net.IPNet, deviceNumber int, vpcCIDRs []string, useExternalSNAT bool,
	netLink netlinkwrapper.NetLink, ns nswrapper.NS, mtu int, log logger.Logger, procSys procsyswrapper.ProcSys) error {

    // Call setupVeth to set up the Pod network environment.
	hostVeth, err := setupVeth(hostVethName, contVethName, netnsPath, addr, netLink, ns, mtu, procSys, log)
    ...
	addrHostAddr := &net.IPNet{
		IP:   addr.IP,
		Mask: net.CIDRMask(32, 32)}

    // Add a route to the Pod in the node's main route table: ip route add $ip dev veth-1.
	route := netlink.Route{
		LinkIndex: hostVeth.Attrs().Index,
		Scope:     netlink.SCOPE_LINK,
		Dst:       addrHostAddr}
   
    // The netlink interface wraps Linux commands such as ip link, ip route, and ip rule.
	if err := netLink.RouteReplace(&route); err != nil {
		return errors.Wrapf(err, "setupNS: unable to add or replace route entry for %s", route.Dst.IP.String())
	}
    
    // Add the to-Pod policy routing rule: 512: from all to 10.0.97.30 lookup main.
	err = addContainerRule(netLink, true, addr, mainRouteTable)
       ...
    
    // Use ENI deviceNumber to determine whether this is the primary ENI; 0 means primary ENI.
    // If the ENI is not the primary ENI, add the policy route for traffic leaving the Pod:
    // 1536: from 10.0.97.30 lookup eni-1.
	if deviceNumber > 0 {
		tableNumber := deviceNumber + 1
		err = addContainerRule(netLink, false, addr, tableNumber)
        ...
	}
	return nil
}

The final effect looks like this:

ip rule list
0:    from all lookup local 
512:  from all to 10.0.97.30 lookup main <---------- to Pod's traffic
1025: not from all to 10.0.0.0/16 lookup main 
1536: from 10.0.97.30 lookup eni-1 <-------------- from Pod's traffic

createVethPairContext

The createVethPairContext struct contains the parameters needed to create a veth pair. Its run method is the concrete implementation behind setupVeth. It creates the veth pair, brings both ends up, configures the Pod gateway, installs routes, and so on.

func newCreateVethPairContext(contVethName string, hostVethName string, addr *net.IPNet, mtu int) *createVethPairContext {
	return &createVethPairContext{
		contVethName: contVethName,
		hostVethName: hostVethName,
		addr:         addr,
		netLink:      netlinkwrapper.NewNetLink(),
		ip:           ipwrapper.NewIP(),
		mtu:          mtu,
	}
}

func (createVethContext *createVethPairContext) run(hostNS ns.NetNS) error {
	veth := &netlink.Veth{
		LinkAttrs: netlink.LinkAttrs{
			Name:  createVethContext.contVethName,
			Flags: net.FlagUp,
			MTU:   createVethContext.mtu,
		},
		PeerName: createVethContext.hostVethName,
	}
    
    // Run ip link add to create the veth pair for the Pod.
	if err := createVethContext.netLink.LinkAdd(veth); err != nil {
		return err
	}

	hostVeth, err := createVethContext.netLink.LinkByName(createVethContext.hostVethName)
	...
    // Run ip link set $link up to bring up the host-side veth.
	if err = createVethContext.netLink.LinkSetUp(hostVeth); err != nil {
		return errors.Wrapf(err, "setup NS network: failed to set link %q up", createVethContext.hostVethName)
	}

	contVeth, err := createVethContext.netLink.LinkByName(createVethContext.contVethName)
	if err != nil {
		return errors.Wrapf(err, "setup NS network: failed to find link %q", createVethContext.contVethName)
	}

	// Bring up the Pod-side veth.
	if err = createVethContext.netLink.LinkSetUp(contVeth); err != nil {
		return errors.Wrapf(err, "setup NS network: failed to set link %q up", createVethContext.contVethName)
	}

    // Add default gateway 169.254.1.1: route add default gw addr.
	if err = createVethContext.netLink.RouteReplace(&netlink.Route{
		LinkIndex: contVeth.Attrs().Index,
		Scope:     netlink.SCOPE_LINK,
		Dst:       gwNet}); err != nil {
		return errors.Wrap(err, "setup NS network: failed to add default gateway")
	}

    // Add default route. The effect is: default via 169.254.1.1 dev eth0.
	if err = createVethContext.ip.AddDefaultRoute(gwNet.IP, contVeth); err != nil {
		return errors.Wrap(err, "setup NS network: failed to add default route")
	}
    
    // Add an IP address to eth0: ip addr add $ip dev eth0.
	if err = createVethContext.netLink.AddrAdd(contVeth, &netlink.Addr{IPNet: createVethContext.addr}); err != nil {
		return errors.Wrapf(err, "setup NS network: failed to add IP addr to %q", createVethContext.contVethName)
	}

	// Add a static ARP entry for the default gateway.
	neigh := &netlink.Neigh{
		LinkIndex:    contVeth.Attrs().Index,
		State:        netlink.NUD_PERMANENT,
		IP:           gwNet.IP,
		HardwareAddr: hostVeth.Attrs().HardwareAddr,
	}

	if err = createVethContext.netLink.NeighAdd(neigh); err != nil {
		return errors.Wrap(err, "setup NS network: failed to add static ARP")
	}
    
    // Move one end of the veth pair into the host-side network namespace.
	if err = createVethContext.netLink.LinkSetNsFd(hostVeth, int(hostNS.Fd())); err != nil {
		return errors.Wrap(err, "setup NS network: failed to move veth to host netns")
	}
	return nil
}

TeardownNS

TeardownNS cleans up the Pod network environment.

func (os *linuxNetwork) TeardownNS(addr *net.IPNet, deviceNumber int, log logger.Logger) error {
	log.Debugf("TeardownNS: addr %s, deviceNumber %d", addr.String(), deviceNumber)
	return tearDownNS(addr, deviceNumber, os.netLink, log)
}

func tearDownNS(addr *net.IPNet, deviceNumber int, netLink netlinkwrapper.NetLink, log logger.Logger) error {
   ...
	// Delete the to-Pod policy routing rule by running ip rule del.
	toContainerRule := netLink.NewRule()
	toContainerRule.Dst = addr
	toContainerRule.Priority = toContainerRulePriority
	err := netLink.RuleDel(toContainerRule)
     ...
    // If the ENI is not the primary ENI, also delete the from-Pod policy routing rule.
	if deviceNumber > 0 {
		err := deleteRuleListBySrc(*addr)
      ...
	}
	addrHostAddr := &net.IPNet{
		IP:   addr.IP,
		Mask: net.CIDRMask(32, 32)}
         ...
	return nil
}

IPAMD

IPAMD is the local IP address pool manager. It runs as a DaemonSet on every worker node and maintains all available IP addresses on that node. The next question is: where does the data in the IP pool come from?

In AWS EC2, instance metadata stores information about the instance, including all ENIs attached to the EC2 instance and all IP addresses on those ENIs. It also exposes metadata APIs such as:

curl http://169.254.169.254/latest/meta-data/network/interfaces/macs/
curl http://169.254.169.254/latest/meta-data/network/interfaces/macs/0a:da:9d:51:47:28/local-ipv4s

During initialization, ipamd reads ENI and IP information through this interface and stores it in dataStore. This process is implemented in nodeInit.

nodeInit

func (c *IPAMContext) nodeInit() error {
        ...
        // Request the EC2 metadata API to obtain all ENI data.
	metadataResult, err := c.awsClient.DescribeAllENIs()
	...
	enis := c.filterUnmanagedENIs(metadataResult.ENIMetadata)
         ....
		// Add ENI information.
		retry := 0
		for {
			retry++
			if err = c.setupENI(eni.ENIID, eni, isTrunkENI, isEFAENI); err == nil {
				log.Infof("ENI %s set up.", eni.ENIID)
				break
			}
                 ...
	return nil
}

setupENI

The main job of setupENI is to initialize dataStore, including:

Adding the ENI to dataStore.
Bringing up the veth pair associated with the ENI.
Adding all secondary IPs from the ENI to dataStore.

func (c *IPAMContext) setupENI(eni string, eniMetadata awsutils.ENIMetadata, isTrunkENI, isEFAENI bool) error {
	primaryENI := c.awsClient.GetPrimaryENI()
    
	err := c.dataStore.AddENI(eni, eniMetadata.DeviceNumber, eni == primaryENI, isTrunkENI, isEFAENI)
	...
	c.primaryIP[eni] = eniMetadata.PrimaryIPv4Address()

	if eni != primaryENI {
		err = c.networkClient.SetupENINetwork(c.primaryIP[eni], eniMetadata.MAC, eniMetadata.DeviceNumber, eniMetadata.SubnetIPv4CIDR)
        ...
	}
    ...
	c.addENIsecondaryIPsToDataStore(eniMetadata.IPv4Addresses, eni)
	c.addENIprefixesToDataStore(eniMetadata.IPv4Prefixes, eni)

	return nil
}

dataStore

dataStore is a local DB constructed from Go structs. It maintains ENI information for the local node and all IP addresses bound to those ENIs. Each IP record uses ipamKey as its key. When an IP is allocated, the key is set to (network name, CNI_CONTAINERID, CNI_IFNAME). When the IP is not allocated, ipamKey is empty.

Code path: /pkg/ipamd/datastore/data_store.go

type DataStore struct {
	total                    int 
	assigned                 int  
	allocatedPrefix          int
	eniPool                  ENIPool 
	lock                     sync.Mutex
	log                      logger.Logger
	CheckpointMigrationPhase int 
	backingStore             Checkpointer
	cri                      cri.APIs
	isPDEnabled              bool
}

type ENI struct {
	ID         string
	createTime time.Time
	IsPrimary bool
	IsTrunk bool
	IsEFA bool
	DeviceNumber int
	AvailableIPv4Cidrs map[string]*CidrInfo
}

type AddressInfo struct {
	IPAMKey        IPAMKey
	Address        string
	UnassignedTime time.Time
}

type CidrInfo struct {
	Cidr net.IPNet    // 192.168.1.1/24
	IPv4Addresses map[string]*AddressInfo
	IsPrefix bool
}

type ENIPool map[string]*ENI   // [eniid]eni

dataStore has two important methods: AssignPodIPv4Address and UnAssignPodIPv4Address. In essence, CNI uses these methods to allocate and release IP addresses.

AssignPodIPv4Address

// Allocate an IP address to a Pod.
func (ds *DataStore) AssignPodIPv4Address(ipamKey IPAMKey) (ipv4address string, deviceNumber int, err error) {
   // Lock dataStore operations.
	ds.lock.Lock()
	defer ds.lock.Unlock()
      ...
      // Iterate through dataStore's eniPool to find an available IP.
      for _, eni := range ds.eniPool {
		for _, availableCidr := range eni.AvailableIPv4Cidrs {
			var addr *AddressInfo
			var strPrivateIPv4 string
			var err error

			if (ds.isPDEnabled && availableCidr.IsPrefix) || (!ds.isPDEnabled && !availableCidr.IsPrefix) {
				strPrivateIPv4, err = ds.getFreeIPv4AddrfromCidr(availableCidr)
				if err != nil {
					ds.log.Debugf("Unable to get IP address from CIDR: %v", err)
					// Check in next CIDR.
					continue
				}
				...

			addr = availableCidr.IPv4Addresses[strPrivateIPv4]
		        ...
			availableCidr.IPv4Addresses[strPrivateIPv4] = addr
            // For an allocated IP, set its ipamKey.
			ds.assignPodIPv4AddressUnsafe(ipamKey, eni, addr)
                         ...
			return addr.Address, eni.DeviceNumber, nil
		}
	}
    ...
}

UnAssignPodIPv4Address

// Release an IP address.
func (ds *DataStore) UnassignPodIPv4Address(ipamKey IPAMKey) (e *ENI, ip string, deviceNumber int, err error) {

    ...
    // Use ipamKey to find the corresponding Pod IP address in eniPool.
	eni, availableCidr, addr := ds.eniPool.FindAddressForSandbox(ipamKey)
    ...
    // Call unassignPodIPv4AddressUnsafe to mark the IP as unallocated by clearing its ipamKey.
	ds.unassignPodIPv4AddressUnsafe(addr)
	...
    // Set the IP release time to now.
	addr.UnassignedTime = time.Now()
    ...
	return eni, addr.Address, eni.DeviceNumber, nil
}

Ref

Source code analysis of the AWS VPC-CNI plugin in Kubernetes CNI

Kubernetes Networking Series: VPC-CNI