key: cord-0101898-8n2ao23k authors: Fang, Kevin title: Ruta: Dis-aggregated routing system over multi-cloud date: 2021-12-16 journal: nan DOI: nan sha: b662661f31f4f70086cd5bac2dee9b76906bfa1e doc_id: 101898 cord_uid: 8n2ao23k Over the years, the SDN evolution create multiple overlay technologies which is inefficient and hard to deploy end-to-end traffic engineering services, Ruta is designed as an unified encapsulation with Segment Routing, Crypto and NAT-Traversal capabilities over UDP. Ruta could be deployed as a cloud native SDN platform globally over multi-cloud and integrated with each applications on transport layer, which provide nearly zero loss and almost less than 200ms latency to access anywhere in the world over internet. Software Defined Network and Cloud VPC evolution create multiple overlay technologies in last decade, the different implementation separate network into multiple domains. Meanwhile applications require simplicity for end-to-end traffic engineering and security policy enforcement, the domain-specific SDN design cause many overheads on both control and data plane. Paper Organization: We introduce the challenges of overlay technologies used today and the motivation for Ruta project in Section 1. We present the control plane architecture in Section 2, data plane architecture in Section 3 and prototype system implementation and demonstrate multi-cloud deployment in Section 4.We conclude in Section 5. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. draft, Jan 07-11, 2022, Shanghai,China © 2021 Association for Computing Machinery. ACM ISBN XXX-X-XXXX-XXXX-X/XX/XX. . . $15.00 https://doi.org/xx A packet from client to cloud require multiple times encap/decap and encryption/decryption which introduce significant latency and inter-working complexity on each boarder network devices . A packet which send by a wireless client require location-agnostic, vendors like Cisco [4] or juniper [8] implement overlay and group based policy for wireless converged campus network. Each vendor has private control-plane protocol(lisp [5] , BGP-EVPN) and dataplane( vxlan-gbp [12] vxlan-gpe-gbp and lisp-gpe-gbp [7] ). The campus network fabric policy design based on user-centric approach, thus group based policy tag only contain user or group identity. When the packet arrive campus boarder, SDWAN Router will decap the packet and enable deep-packet inspection(DPI) for application aware routing or security policy enforcement, packet will be encap and encrypted in IPSec or DTLS with a transport VPN tag, then sent to remote on-prem datacenter or public cloud. In SDWAN domain policy design based on transport-centric approach. Most of the SDWAN implementations are based on point-to-point tunnel,thus the multi-hop traffic engineering may require multiple times encryption and decryption with multiple times policy lookup. Some of the service provider use MPLS-VPN, MPLS-Segment Routing or SRv6 as overlay to handle the traffic, MPLS require dedicated underlay and SRv6 require IPv6 link, and the SRv6 forwarding mechanism does not change the source address which may cause problems when unicast reverse path forwarding(uRPF) enabled, so most of them can not directly transport over internet. Even with segment routing over IP (RFC8663) may support MPLS-SR over IPv4 UDP, but the forwarding mechanism does not support NAT-Traversal. When packet arrive on-prem datacenter, the network policy design based on application-centric approach, group based tag like Endpoint Group(EPG) in Cisco ACI architecture [3] identify application sever groups. The boarder gateway have to update the source and lookup the destination group id and set related VNID and group ID. It's more complex in public cloud, Virtual Private Cloud(VPC) are based on region and available zones, inter-VPC communication require complex routing and security group policy which requires VPC-Peering or Transit-VPC services. cloud-agnostic require more generic overlay network support, but different cloud provider may have difference overlay terminology introduce significant workload for multi-cloud deployment. Inspired by cloud native approach, we design Ruta based on servicemesh architecture, we dis-aggregate routing service to distributed K-V control-plane and micro-service based data-plane to provide unified encapsulation to meet each network domain's requirement. arXiv:2112.08686v2 [cs.NI] 9 Jan 2022 It could significantly reduce the inter-working complexity and provide flexible programming interface for applications. However, Design dis-aggregated routing platform is not simply leverage distributed software architecture to build controller and data-plane, but it require more carefully trade-off on architecture level,because the system failure may easily cause network partitioning and lose availability. Based on CAP theorem [1] , some design principals are shown as below. Consistency is the must to have feature in traditional network,the destination-based forwarding requires all nodes have consistent forwarding table, Inconsistent forwarding table scenario may wellknown as "Micro Loop" or "Black hole". Meanwhile, the traditional routing protocols are designed to serve destination based forwarding. No matter which routing protocol were used, it must implement a routing information database(RIB) and keep availability for data-plane calculate the forwarding information database(RIB). Most of the routing protocols soften the partition tolerance as a trade-off, for instance OSPF-area or BGP-Autonomous System are designed to hide topology and isolate failure. At the same time, keeping availability may loss partition tolerance, many routing protocol need to deal with "brain-split" situation. Leader arbitration, split-horizon are designed to reduce the network partition impact, but it has side effect for the availability. In the SDN Era, control-plane is centralized, many network devices need sequential consistency, and the DHCP like address assignment need strictly consistency. The controller implementation and placement under network partition becomes major challenge. For instance, recently facebook outage [14] indicate the controller availability is the root cause. In summary, the consistency requirement and destination based forwarding is a kind of trade-off few decades ago. In the old days, the forwarding ASICs or Network Processors may have limited forwarding capabilities and the network has limited bandwidth to carry more information for source routing. Can we soften the consistency requirement by introducing the source-routing or segment-routing(SR) and decouple the devices configuration ? Can we soften the availability requirement by using distributed path-compute on each forwarding adjacency when running into the headless mode? These are the design principals for Ruta system, we implement an eventually consistency model to decouple the prefix announcement and path-computation process by using a location descriptor, then we leverage local cache mechanism for distributed pathcomputation during control-plane offline. We select ETCD as our control-plane and simplify all routing updates and decouple the policy related configurations in K-V pairs. Ruta separate the consistency requirement by using distributed path-calculation and segment routing to decouple overlay route information and linkstate, linecard could use partially linkstate db for path calculation without global consistency requirement, In some large scale deployment, users could deploy multiple redis based service node for regional linkstate database and provide cache service for linecards. Segment Routing(SR) with loop-free alternative provide better resilience under network partition, however it only support IPv6 or MPLS, even RFC8663 may support MPLS based SR over UDP, but it lack of programmability and NAT-traversal support. Meanwhile,as a cloud-native routing system, we need to use a pure user-space dataplane to bypass kernel overhead and simplify the programmability for application developers. SRv6 may require kernel support to add IPv6 Option header which is not friendly for application developers. We develop a new data-plane protocol by using segment routing over UDP(SRoU). SRoU header can be integrated with QUIC to provide secure and reliable transport services. Many public cloud and many internet service still using IPv4 and deal with the SRv6 uRPF issue, we encode the original source address in SRoU, then we develop STUN services to resolve IPv4 NAT-Traversal problem. Ruta control plane is heavily use the ETCD features like lease, distributed lock and transparent proxy. Each service node could be a proxy for others connect to ETCD. All communications to ETCD require TLS, RBAC could be added enhance system security. Ruta define 5 roles of service node which is shown in Figure. 1. • ETCD: This node is running ETCD process to provide distributed K-V store services as Ruta control-plane. All Ruta node info, routing table, policy and link state are represent in K-V pairs. • Fabric: This node type is used as an middle box to relay the SRoU packets. This node could be implemented by in DPDK based appliance, or offload packet processing on DPU or pure P4 based switch to provide high bandwidth forwarding capabilities. This node must enable link probe to other Fabric node and report link-state to K-V store. • Linecard: This node fetch routing table and topology from ETCD and execute flexible algorithm to find available path and encode SRoU SID-list in each packet. Linecard node has various types to provide basic transport service or tightly integrate with applications.It could be physical SDWAN routing box , mobile vpn client, sidecar proxy, DPU or even a QUIC socket library integrated with applications. • STUN: This node is used as a STUN(Session Traversal Utilities for NAT,RFC5389) server to help IPv4 based Fabric node and linecard node to retrieve their public address and port. • Analytic: This node is used to watch stats from ETCD, analysis network failure and gives proactive response for policy change based on AIOps approach. • LSDB(optional): This node is used as a regional linkstate database and cache to mitigate ETCD performance issue with in very large-scale and global deployment. This node contains a Redis DB for linkstate information. Fabric and Linecard node could use geometry information hunting the nearest LSDB Node. All service node must register to ETCD based on the following K-V pairs • Key:= "/node//" is defined in section 2.1, is a system wide unique value, just like traditional Router-ID, users could continues use router-id in this field,or use hostname string instead. • Value: contains Site-ID, Location and SystemLabel Site-ID: This field is used for site level policy enforcement. Location: This field store the node latitude and longitude information. Linecard node could apply flexible geo-awarerouting or build random-graph to reduce computational complexity in large scale network. SystemLabel: This is 24bits field for Segment ID compression and MPLS interworking. Label assignment is implemented by distributed lock mechanism over ETCD, each node acquire the lock from ETCD then assign unique smallest Number as SystemLabel, and register it back to ETCD with SystemLabel field. Each of service node has multiple interfaces, Ruta use Service Locator(SLoC) describe them. After Node registration, service node must send Service Location Route for service discovery. • Key:= "/service//" • Value: SLoC data structure will be used in this field. Service Node could use prefix based fetch /service/ for service hunting, For instance, a new on-boarding device need STUN service, it may fetch /service/STUN as prefix from ETCD to hunt STUN server's public ip address and port. When Fabric node finished the on-boarding process, it must fetch the /service/lsdb to discover the LinkState Database node. it must use prefix based fetch /service/Fabric to discover other fabric nodes, then start full-mesh link state probe to others. In some large scale deployment, Fabric node could config a white-list to reduce fullmesh probe. The probe result need to update in LSDB Redis DB, if system does not contain LSDB node, it must send to ETCD, Link state K-V pair is shown as below: • Key:= "/stats/linkstate/< ->" • Value: Ruta Link state probe leverage the algorithm from Two-Way Active Measurement Protocol(RFC5357) [9] , it could provide two-way delay, jitter, loss measurement, it will also report the link utilization and up/down status. Inspired by Locator/Identifier Separation Protocol(LISP), Ruta service route does not contain any explicit nexthop but just mapping the uniform resource identifier(URI) to SLoC. Ruta URI service route framework not only designed for packet routing service, but also support various of new services(eg. multi-cloud RPC, edge computing node). For multi-cloud packet routing services, Ruta could carry EVPN-Route as resource identifier. • Key:="/route////*" Route Tagert(RT) and Route-Distinguisher(RD) usage are same as BGP-EVPN, RT mechanism could be implemented by watch "/route///" prefix. EVPN route URI listed below: Type Key EVPN Type2 /route/2///MAC/IP EVPN Type5 /route/5///IPPrefix/Mask Unlike traditional routing protocol, Ruta is used for overlay transportation. The path computation logic is more like google map navigation, it could support various of algorithms to meet difference SLA.Meanwhile it does not require consistency of link-state database which is very useful during network failure. By default, the linecard will build destination SLoC list based on EVPN routes, it will only use active link state probe for each destination service node. If some of the route has SLA violation, the linecard could use a fabric-node-list or randomly selected fabric nodes to fetch the related link-state, and execute local path computation. Path Computation engine will generate a SLoC list for each EVPN route. This list will be encoded in SRoU header. Service Node keepalive leverage the ETCD lease function, When a node failed to update lease time, the ETCD will automatic withdraw the related information. Consider the controller availability we implement 2 different lease time. The first lease time(60 120 seconds) is used for service node keepalive. The second lease time(600 seconds 1200seconds) used for linkstate and service route. Endpoint Identity: Each of the endpoint may have it's identity or group policy tags, it could be updated by the following K-V pairs Key:="/identity/userid/device-id" Value:= "group policy tags" Group based policy: Each node may use the following K-V pairs for group based micro-segmentation: Key:= "/control/group/ / " Value:= "Action"|"SLoC list" Route control:Network operator could update the control policy to the entire system by using: Key:= "/control/RT/2/ / / / " Key:= "/control/RT/5/ / / / " Value:= "Action"|"SLoC list" Ruta dataplane leverage the SRv6 programmable SRH concept, but move the SRH after the UDP header, this new encapsulation called SR over UDP(SRoU).It add explicit FlowID field for microsegmentation. Source IP address and port are added for NAT-Traversal. SRoU header defined as below: . . . optional TLV • Optional TLV: 0x0 used for padding, 0x1 used for SR Integrity, 0x2 used for PathTelemetry. SRoU OAM Message format defined as below: OAM type:=0x0 used for linkstate, subtype 0x0 used for Linkstate Request, subtype 0x1 used for Linkstate Response. the OAM Payload contains , ,, and . OAM type:=0x1 reserved for trace route, OAM type:=0x2 used for STUN service. Same as SRv6 Network Programming [2] , Ruta defined a virutal SLoC for network programming, IPv6 is same as RFC8986, Ruta under IPv4 based SLoC follows the following definition: We implement End.DT2U and End.DT4 in our prototype to provide VPN services. We implement Ruta prototype in native golang with nearly 9000 loc, cross compile could support multiple platform(x86/arm based linux, mips and arm based OpenWRT). We also patch some codes to quic-go to support QUIC transport protocol over Ruta. We deploy ruta with 2 linecard nodes as leaf ToR switch and 2 fabric nodes as spine, topology shown as below,demo code available in [11] . ETCD cluster could be deployed in multiple region to provide better resilience, each ruta node could be run as etcd proxy mode to help other nodes use in-band communication to register to ETCD cluster.For instance Spine_Fabric node could be etcd proxy for Linecard node in this deployment scenario. Linecard may have multiple uplinks, it could be represented and encoded in different colors and UDP ports in SLoC. The following chart shows the information retrieve from ETCD. Linecard will learn the MAC and IP address information from host and announce EVPN-Type2 route in ETCD, type-5 route were learned by local configuration or route redistribution by other protocols. FlexAlgo could be added in each linecard. In general case, H1->H2 communication could be encoded by Linecard1 with only one The total SRoU header length is 24Bytes which is equal to SRv6 SRH, but it has IPv4 underlay which is more efficient than SRv6. Compare with VXLAN, the SRoU flowID header could be used as application aware tag or group based policy header, it could easily support traffic-engineering, however VXLAN need to use multiple NSH header instead. For instance, when network congestion, the Linecard could add more SLoC in segment list for traffic engineering. Spine-A will invoke the SRoU stack to relay packet to Linecard_B. Real time collaboration(RTC) like Cisco Webex, Zoom and Microsoft Teams are widely used during COVID-19, but the internet directly connection performance does not meet the application's requirement. Internet routing is based on BGP AS-PATH, congestion always observed between AS, can we build a pinhole between service providers without change any BGP routes? A multihoming ruta fabric node with SRoU encapsulation could easily implement packet relay between SP, meanwhile public cloud service providers always multi-homing with other traditional service providers and internet exchanges, deploy virtual machine on public cloud as Ruta fabric node could significantly improve RTC performance. We need to emphasize multi-cloud deployment is useful,for instance, from Alicloud Frankfurt region to Alicloud Chengdu region, the directly internet connections over elastic IP over internet has 385 milliseconds latency, however if we relay it at Tencent Cloud guangzhou region will lower down the latency to 240 milliseconds. Finally, we deploy 20 fabric nodes over multiple public cloud providers (Alicloud and Tencent) over the world, the result shows ruta could significantly reduce the latency and packet drop, It could provide nearly zero loss and almost less than 200ms latency to access anywhere in the world over internet with maximum 4 segments, full result available in [10] . QUIC is a reliable and secured transport protocol in user-space, we implement SRoU function with quic-go to enhance the network programmability to provide traffic-engineering and multi-path forwarding capabilities over internet and directly traversal VPC. In native socket mode, server may register to Ruta ETCD to provide application awareness,especially in K8s deployment case, Ruta ETCD could directly sync the server's identity with K8s ETCD. Client does not required to register to ETCD, a simple DNS SRV record could be used to announce the edge_fabric and transit_fabric. The client select edge fabric 1 could send the packet in the following format, consider the network security issue, Edge fabric1 could allocation some time based token for client, this token could be store in flowid field. final server ip and port segment list [1] transit fabric node Inner Payload QUIC packet Table 3 : Service Locator(SLoC) Field client may not require to connect to STUN server for public address discovery, it simply fill ZERO in the SRoU source address and port field. The first hop fabric node will help copy the underlay source address inside the SRoU header for NAT-Traversal. when the server receive the packet and find the UDP first 8 bits is ALLZERO, it will copy the internal SRoU header source address and port to Outer IP header and trim the SRoU to deliver the packet to application. The native socket mode is not only support QUIC packets, but also support TCP and other UDP encapsulations. Existing VPC design requires cloud service provider manage the overlay routing table which makes some scale challenges for network devices [13] . Many hybrid cloud solution require SDWAN or IPSec VPN as Gateway to bring traffic into the VPC, however it's very hard to inter working with overlay VPC routing table with existing private network, meanwhile each cloud provider has private micro-segmentation polices design, that may introduce significant efforts for interworking. Ruta provide a cloud-native based approach to resolve the challenges, by enable Ruta linecard at each VPC and provide Segment Routing capability on overlay makes the VPC more transparency and the cloud provider could significantly reduce the overlay routing control and offload service gateway to each host or dedicated DPU. This dis-aggregation routing system could benefit edge cloud and hybrid cloud deployment in the future. Some massive scale cloud service provider could use Ruta as distributed SDN solution. An Ruta system(with multiple LC and Fabric) could be treated as an single Linecard with multiple SLoC to join another Ruta system. Or multiple Ruta System could leverage on BGP-EVPN to share routing information as same as Inter-AS Option C MPLS VPN. Ruta is a dis-aggregated routing system based on cloud native approach, we develop a new K-V based control plane to decouple the configuration and management complexity and provide distributed deployment which perfectly resolve the centralized SDN controller challenge. Meanwhile we provide a Segment routing based transport layer, it provide nearly zero loss and almost less than 200ms latency to access anywhere in the world over internet. We also enable native socket support for endpoint enable QUIC multi-path capabilities, even more we simplify the cloud VPC deployment and provide transparency VPC to support hybrid-cloud deployment. Ruta is a first step towards cloud native datapath, we believe it cloud be used in more scenarios like NetDAM [6] for HPC , new container network interface and cloud-agnostic RPC framework in the future. we are exploring to enable Ruta for P4 based switch and DPU offload , server-less computing and datastreaming processing cases in the future. It took a village to make Ruta possible, we thank Feng Cai, Yanhuan Mao, Yinghao Li, Bin Shi, Yijen Wang, Xing Jiang, Yin Wang, Sam Gao support for this project. CAP theorem Segment Routing over IPv6 (SRv6) Network Programming Cisco Application Centric Infrastructure Locator/ID Separation Protocol NetDAM: Network Direct Attached Memory with Programmable In-Memory Computing ISA. (2021) Group Policy Encoding with VXLAN-GPE and LISP-GPE Juniper Mist AI solution Two-Way Active Measurement Protocol Ruta Multicloud optimization result Ruta Spine Leaf demo VXLAN Group Policy Option Tao Huang, and Shunmin Zhu. 2021. Sailfish: Accelerating Cloud-Scale Multi-Tenant Multi-Service Gateways with Programmable Switches