Application of Network Telemetry in RoCE Network Optimization

With the increasing scale of Ethernet network in the data center, the single port bandwidth has also developed from 10g, 25g to more than 100g. At the same time, the complexity of network deployment has gradually increased, and users' requirements for service quality have also been continuously improved. Network operation and maintenance must be more refined and intelligent. Today's network operation and maintenance faces the following challenges:
With the emergence of super large data centers, more and more network devices need to be managed, and the amount of information to be monitored is very large.
Based on RDMA (remote direct memory access technology) and the universal application of lossless Ethernet technology, microsecond delay forwarding from computing node to storage node is realized, which greatly optimizes the end-to-end service forwarding performance. This also means that it poses a higher challenge to network operation and maintenance, and can quickly locate faults and reach the fault location speed of seconds or even sub seconds.
More and more data need to be carefully monitored in the network, and the monitoring granularity is finer, so as to completely and accurately reflect the network conditions, predict the possible faults, and provide a strong data basis for network optimization. Network operation and maintenance not only needs to monitor the traffic statistics on the interface, packet loss on each stream, CPU and memory occupation, but also needs to monitor the delay jitter of each stream, the delay of each message on the transmission path, the buffer occupation on each device, etc.
In view of the above requirements, traditional network monitoring technologies (such as SNMP, CLI, log, etc.) have application disadvantages:
Traditional network monitoring technology mainly uses "pull mode" to obtain data, that is, sending requests to obtain data on devices, which limits the number of network devices that can be monitored, and can not obtain data quickly.
Although SNMP trap and log use "push mode" to obtain data, that is, the device actively reports the data to the monitoring device, it only reports events and alarms. The content of monitored data is extremely limited and can not accurately reflect the network status.
Telemetry is a remote data acquisition technology for monitoring equipment performance and faults. It uses the "push mode" to obtain rich monitoring data in time, which can realize the rapid positioning of network faults, so as to solve the above network operation and maintenance problems.
Period5 Mar 202314 Dec 2023
