Scalable and Flexible Routing Service for Tencent Cloud Access Network
Allen Lv, TencentAug, 2020
Agenda
• Challenges
• Architecture
• Design Details
• Experience and Future Work
3
Tencent Cloud Infrastructure Overview
54+AZs
27+Regions
100T+Public Network
Bandwidth Reserved
15EB+Storage
Enterprise Branch
CVM
CVM
CDB
VPC
Private Line
| Tencent Cloud Access Network Overview
Tencent Cloud Network
CVM
CVM
CDB
VPC
AccessSite 1
Enterprise Branch
Private Line
VPN
AccessSite 2
VPN
Custom IDC
Custom IDC
ISP
Tencent Internet exchange (TIX)
ISP
Tencent Internet exchange (TIX)
5
Challenges
• Massive scale forwarding table, VRFs, Tunnels…
• Roll out network features fast
• Scale up easily for rapidly growth of traffic volume
• Low Cost
Line Card Line Card Line Card
| Traditional Commodity Router
• Hardware & Software Vendor Lock-in
• Hard to Scale
• Lack of feature velocity
• High Cost
PrimaryProcessor
SecondaryProcessor
Switching Fabric
7
| Design Philosophies
• Scalability, each component scales up independently on demand
• Flexibility, fast features delivery (~ 2 weeks)
• Reliability, NSF, NSR, fast failover
• Operationality
Software Defined Router(SDR) Overview
CS
AS AS
eBGPISP
Overlay
Data Plane
Underlay
CS
AS
CS CS
AS AS
NFV BasedRouting System
AS
Data Plane … Data Plane
Routing Plane Routing Plane … Routing Plane
Control Plane Control Plane … Control Plane
Orchestrator Orchestrator … Orchestrator
NFV BasedForwarding System
NFV BasedController
NFV BasedOrchestrator
eBGP
CustomerRouter
eBGP
CPEipsec
Tencent Cloud
Tencent FW
Tencent DDoS
Software Defined Router(SDR) Inside
Edge Access(EA)
BGP
NGW
RNSO
ExternalRouter
GNSOOSS/BSS
VPC
NGWData Plane
BGPRouting Plane
RNSOControl Plane
BGP/BFD
FIB/ARPconfig/monitor
TGRE VxLAN
GNSOOrchestrator
gRPC
config/monitor
FIB/ARP
| Customer Access (Private-Line GW & VPNGW)
VPNGW(SDR)
PLGW(SDR)
VPC 10.0.0.0/16
Interoperating with both Traditional Network and SDN-Based Network at large scale
BGP Session
EA
BGP Session
Internet
CustomerRouter
Traditional NetworkSDN-Based Network
| End-user Access (Tencent Internet Exchange)
Large scale forwarding table (10M) and flexible Traffic Engineering
EA2TIX2(SDR)
ISP Router2
BGP Session
VPC1 115.159.246.0/24
VPC2 116.150.247.0/24
EA1TIX1(SDR)
ISP Router1
BGP Session
VxLAN Fabric
| Flexibility – On-Demand Traffic Engineering
• Flexible traffic engineering based on userdemand
Site1
VPC
SDR2
Site2
VxLAN Fabric
ExternalPeer 1
ExternalPeer 2
ExternalPeer 3
ExternalPeer 4
SDR1
<SIP,DIP> ---> <SDR2, VNI>
BGP route
| Flexibility - FW Service
• Support >100k flex rules for FW purpose
Data Plane
SDR
VPC
VxLANFabric
FW Service
ExternalRouter
EA
<DIP> --> <FW, VNI><SIP> --> <FW, VNI>
| Flexibility - DDoS Service
SDR
VPC
DDoS Service
EA
180.10.1.1/32, DDoS
ExternalRouter
BGP route 180.10.1.1/32
Data Plane
| Flexibility - DDoS Service
• Redirect attack traffic to DDoS service efficientlySDR
VPC
DDoS Service
EA180.10.1.1/32, DDoS0.0.0.0/0, DP
ExternalRouter
BGP route 180.10.1.1/32
Data Plane Only processing the real traffic
| Flexibility - Interoperability
• Interoperate with existing traditional routers
• Speed up deployment of SDR
SDR
VPC
ExistingCommodityRouter
MPLSFabric
RoutingPlane
DataPlane
MPLSSwitch
ExternalRouter
eBGP
| Scalability
CS
AS
CS
AS
CS CS
AS AS
NGW FCR
AS AS
RNSO
AS AS
GNSONGWData Plane
FCRRouting Plane
RNSOControl Plane
GNSOOrchestrator
• Each component scales independently
• Each network can be operated independently
• 3.2Tbps forwarding capacity
eBGP
eBGP eBGP eBGP eBGP
| Scalability - Hardware Acceleration
VPC
EA
Data Plane Tencent SmartSwitch
• Introduce programmable switch for hardwareacceleration
• > 10Tbps forwarding capacityControl Plane
Elephant flow info
Flow offloading
Static flow info forHigh volume traffic
ExternalRouter
SDR
| Reliability – NSF & NSR
• Single node failure will not affect the system
• Data Plane supports Non-stop forwarding (NSF)
• Routing Plane supports Non-Stop Routing (NSR)
ExternalRouter1
ExternalRouter2
RoutingPlane1
RoutingPlane2
Routing System
Control System
Forwarding SystemNGWNGWData Plane
NGWNGW
Control Plane
| Reliability – NSF & NSR
• Single node failure will not affect the system
• Data Plane supports Non-stop forwarding (NSF)
• Routing Plane supports Non-Stop Routing (NSR)
ExternalRouter1
ExternalRouter2
RoutingPlane1
RoutingPlane2
Routing System
Control System
Forwarding SystemNGWNGWData Plane
NGWNGW
Control Plane
| Operationality - Monitoring
• 3 Levels Data Plane Probing
• Critical resources monitoring
• Various statistics and events
Data Plane cluster
core0
server0
core0 corex
RMOS
core0 core0 corex
server1
Cluster LevelHeath check
Server LevelHeath check
Core LevelHeath check
| Operational Experience
• Move manual configurations to centralized orchestrator as much as possible.
• Provide robust “One-Click” operation to quickly turn off the whole system.
• Keep the message queues among different components reliable and efficient.
| Future Work
• End-to-End network quality detection and analysis system for different network layers
• Automatic traffic engineering based on more network metrics like latency, link utilization…
• Simulation and verification system to detect and fix abnormal behaviors in advance
| Conclusion
• Disaggregate functionalities into individual components
• High scalability of each components at each level
• Fast features velocity via software programming
• Low Cost
switch switch …
DataPlane
DataPlane
…
ControlPlane
ControlPlane
…
Orches-trator
Orches-trator
…
Scalability
Flexibility
RoutingPlane
RoutingPlane
…
Thanks