migrating c2 functions to an ip-based virtual architecture...

Migrating Traditional Command and Control (C2) Functions to an IP-Based Virtual Architecture with Digital IF

Document ID: RTL-MWP-077 Date: 23 February 2016

12515 Academy Ridge View • Colorado Springs, CO 80921 • (719) 598-2801 8591 Prairie Trail Drive • Englewood, CO 80112 • (303) 703-3834

41430 Sullyfield Circle, Suite E-1 • Chantilly, VA 20151 • (703) 488-2500 [email protected] • www.rtlogic.com

Real Time Logic, Inc., A Kratos Company

mailto:[email protected]

http://www.rtlogic.com/

Migrating C2 Functions to an IP-Based Virtual Architecture with Digital IF

©Real Time Logic, Inc. 1

New applications of signal digitizers within ground segment infrastructure are bringing welcome changes to how satellite and UAV providers and operators command, control, and distribute data from their systems. The development of Intermediate Frequency (IF) digitizers allows analog spectrum samples to be captured and packetized. This data can be transported using commodity IP network products. Leveraging cloud-based processing services, system functions such as signal monitoring, modulation, demodulation, and recording, once located exclusively within a ground station architecture, can now be provided as a service to any consumer who has access to the Internet.

With this new capability come new challenges. IP networks have non-deterministic characteristics that must be overcome to provide the quality of service levels needed by these systems. Packet loss, re-ordering, duplication, and jitter threaten the integrity of mission data and compromise command and control. This paper shares lessons learned during the deployment of a Digital IF C2 architecture within a cloud-based processing system over the public internet. It examines some of the complications in applying Digital IF and IP-based architectures to satellite and UAV systems, and presents techniques for overcoming these obstacles.

Introduction Traditional ground station architecture consists of an antenna, amplifiers, frequency converters, and a string of RF switches, modems, and other processing equipment. With the advent of Intermediate Frequency (IF) digitizers and cloud processing resources, it is possible to virtualize much of the traditional ground station. In an IP-based architecture, the output of the frequency converter is fed into an IF digitizer. The digitizer converts between an IF signal and an IP packet stream. This packet stream can be easily transported using commodity network equipment, and can even be sent to a virtual processing center hosted in the cloud.

The advantages of such an architecture are numerous. First, RF switching equipment is expensive and inherently lossy. Handling the switching digitally reduces complexity, cost, and improves performance. Secondly, by replacing the modem and all RF downstream processing equipment with an IF digitizer, ground station hardware footprint is dramatically reduced. This allows for greater portability and ease of deployment. In some cases, processing



equipment contains encryption algorithms or other sensitive components, and relocating this equipment to a more secure location is advantageous.

Further, by removing the tight coupling between the antenna and downstream processing elements, reliability is increased and disaster recovery is simplified. If an antenna site fails or experiences interference, the processing center can easily connect to a different antenna to control the asset. In addition, two antennas can receive from a single asset and relay the IF stream to a single processing center where the signals can be combined. Because cloud resources are hosted virtually, they can be spread across multiple physical machines and locations. Maintenance of processing elements is conducted at a central location, or outsourced entirely to the cloud provider.

Deployment While this technology has been demonstrated in the lab, it was unknown how it would perform over long-haul, public internet links. To test this, the following string was deployed in Englewood, Colorado. First, a modem was configured to modulate a 2 Mbps pseudo-random bit stream (PRBS) with quadrature phase-shift keying (QPSK) centered at 70 MHz. This IF stream, which simulates the output of a down-converter, was fed into a 70 MHz RT Logic SpectralNet digitizer. The digitizer was configured with a bandwidth of 2 MHz and used six bits per sample, resulting in a 28 Mbps UDP stream.

For the cloud platform, RT Logic’s software modem was deployed on Amazon’s Elastic Compute Cloud (EC2) service hosted in Sydney. The virtual machine (VM) type was a c4.large , a dual-core 2.9GHz system that demodulated the QPSK stream using roughly half of its processing power.

While the software modem was able to demodulate the PRBS without bit errors, the stream took repeated lock losses. Over the course of 64 hours, the receiver lost lock 70 times and never stayed locked for more than four hours. As UDP provides no reliability mechanism this was not entirely unexpected. However, TCP could only sustain a throughput of 12 Mbps on this Colorado to Sydney link, so it was not a viable alternative for an application with any sensitivity to latency. TCP could support the 28 Mbps rate from Colorado to an identical VM hosted in Oregon, but this receiver still repeatedly lost lock over the 24 hour test period. Closer inspection revealed that on ten occasions, data was delayed by more than one second, and on three occasions data was delayed by more than six seconds. These delays caused the



source side buffer to overflow, resulting in loss of data. While tuning of the source buffers might allow a connection from Colorado to Oregon to succeed, a more widely deployable and lower latency solution was desired. Without any insight into the network itself, it is difficult to determine exactly how and why data was being lost and what might be done to correct it.

Measurement To gain an understanding of the performance of a wide variety of links, VMs were established around the world and across three cloud service providers. VM sites with Amazon EC2 were Northern Virginia, Oregon, Sao Paulo, Frankfurt, and Sydney. With Google’s Compute Engine, sites in their Central US, Western Europe, and East Asia zones were created. These locations are not disclosed, but are likely Mayes County, OK, Dublin, Ireland, and Changhua County, Taiwan. Finally, a Digital Ocean VM in New York City was added. RT Logic’s DataDefender product provides the capability to generate IP streams and collect performance metrics. DataDefender simultaneously sent a UDP stream of 1500 byte packets at 1 Mbps to each of the nine sites from Englewood, Colorado via a dedicated 50 Mbps Century Link business grade link. Additionally, another 1 Mbps stream was sent to each site via a heavily-used corporate link. DataDefender collected detailed performance metrics for each of these streams by tagging each packet in a stream with a sequence number. Every 100 milliseconds, DataDefender generated a time tag denoting the release time of the current packet . The sequence numbers facilitate the detection of packet loss and re-ordering, while the periodic time tag enables the measurement of one-way link latency. These streams were allowed to run for five days, each transferring 36 million packets and 50 GB of data. Results are shown in Table 1.

Table 5. Results From 5 Day One Mbps Test

Provider Location Avg

Latency (ms)

Max Sus. Latency

(ms)

Max Latency

(ms) Total PLR

Worst PLR

Max Burst Loss

Total ROR

Worst ROR

Max OOO Time (ms)

Dedicated 50 Mbps Network

Amazon N. Virginia 22 28 987 5.4E-08 1.9E-05 1 0 0 0

Amazon Sao Paulo 92 118 1218 4.6E-07 3.3E-04 17 0 0 0

Amazon Oregon 21 35 1193 4.5E-06 5.8E-04 8 0 0 0




Latency (ms)

Max Sus. Latency

(ms)

Max Latency

(ms) Total PLR

Worst PLR

Max Burst Loss

Total ROR

Worst ROR

Max OOO Time (ms)

Amazon Sydney 88 90 1193 2.8E-05 1.8E-03 94 0 0 0

Amazon Frankfurt 71 98 1650 6.2E-04 8.8E-02 146 1.4E-07 5.8E-06 30

Google Central US 7 9 2110 1.4E-03 1.1E-02 11 5.3E-04 1.1E-03 976

Google W. Europe 62 95 390 1.4E-03 1.4E-02 339 6.6E-06 5.8E-05 279

Google East Asia 81 88 246 3.8E-05 1.7E-03 16 1.5E-05 3.5E-04 8

Digital Ocean New York 25 42 246 1.0E-06 6.3E-04 33 0 0 0

Corporate Network

Amazon N. Virginia 51 59 1079 1.6E-04 9.2E-04 46 2.7E-08 1.9E-05 1


Amazon Oregon 34 43 1196 1.4E-04 9.0E-04 9 1.4E-07 5.8E-05 12

Amazon Sydney 107 115 1171 2.9E-04 1.8E-03 87 1.1E-07 5.8E-05 17

Amazon Frankfurt 107 127 1673 1.9E-04 3.4E-03 105 1.9E-05 1.9E-05 1




Digital Ocean New York 52 68 411 2.5E-04 1.2E-02 94 2.7E-08 1.9E-05 1

Latency Over the dedicated link, average one-way latency ranged between 7 ms for Google’s Central US data center and 92 ms for Amazon’s Sao Paulo data center. Using linear regression, these latencies fit to a constant 8 ms plus 11 microseconds per mile. The corporate network latencies ranged from 34 ms for Amazon’s Oregon location (likely in The Dalles, OR) to 102 ms for Sao Paulo. These fit to a constant 18 ms plus 11 microseconds per mile. All streams on the corporate network were first routed from Englewood to the corporate internet ingress in San Diego, which explains why the dedicated stream outperformed the corporate stream to Oklahoma by 34 ms, but was only 2 ms faster to Taiwan. In general, these latency measurements were very constant and seemed to represent a nominal latency that deviated only a few times over the five day period.



Maximum sustained latency refers to the largest latency average as measured by DataDefender over a ten minute window. On average, this was 14 ms higher than the nominal latencies for both the corporate and dedicated link. Maximum latency is the largest transit time for a single packet. All the Amazon links and the Google Central US link exhibited a phenomenon where the packet stream was delayed for more than a second. After the delay event, all the pending packers were delivered at once, usually without any loss. The frequency of these events varied from link to link, with Virginia and the Central US experiencing just one such delay over the course of five days, while the Sao Paulo and Frankfurt links saw dozens. These events were perfectly correlated on both the corporate and dedicated networks, pointing to an issue on the public internet or provider cloud.

Ordering DataDefender revealed that the three Google sites and the Amazon Frankfurt location received out-of-order packets, while the other sites did not. Frankfurt had the lowest re-ordering rate (ROR) of 1.40E-7 with a maximum out-of-order time of 30 ms. The Central US location had the highest with a ROR of 5.3E-4, causing more than a third of all the losses recorded on that link. While one of these packets was re-ordered by 976 ms, most were only 1 ms behind the packet they originally preceded. In addition to re-ordering, some links also saw the occasional duplicated packet. On the Central US link, a duplicate packet was received a full nine seconds after the arrival of the original.

Loss Packet loss ratios (PLRs) varied dramatically from link to link. On the dedicated network, the best performing link was to Amazon’s Northern Virginia facility, losing just two packets over the course of five days, and yielding a PLR of 5.4E-8. Unexpectedly, Sao Paulo outperformed New York and Oregon, with a PLR of 4.6E-7 compared to 1.0E-6 and 4.5E-6 respectively. Frankfurt fared the worst of the Amazon centers with a 6.2E-4 PLR. The Google data centers suffered higher loss rates than their same-zone Amazon peers. Google's East Asia facility saw 36% more losses than Amazon's Sydney center, and the connection to Google's Western Europe was 126% worse than the link to Amazon's Frankfurt cloud. Google's Central US location, while geographically closest, was especially lossy with a PLR of 1.4E-3, far worse than any other CONUS link.



Examining the worst PLR measured over a ten-minute period gives intuition into whether loss events were evenly distributed in time or concentrated in temporary periods of poor fidelity. Sao Paulo suffered all 17 of its losses in a single second, and thus its worst case PLR of 3.26E-4 is 709 times worse than the five-day average. As another example, the Frankfurt link lost only half as many packets as the Central US link, but its worst case PLR was greater than that of the central US by a factor of 8.

Figure 1. Comparison of Central US and Frankfurt Loss Characteristics over

Five-Day Period



Corporate versus Dedicated Network On the heavily-used corporate network, loss rates were much higher, measuring at least 1E-4 for all links. Because this network must share bandwidth with many different users, this result is not surprising. By making the simplifying assumption that the dedicated and corporate networks have the same internet ingress location, it is possible to model each data stream as the combination of a network path and an internet path. This yields an over-constrained system with 18 equations and 11 unknowns. A least squares objective is inappropriate, as this weights lossy links more heavily than clean links. Instead, defining the error as the difference between unity and the ratio of the predicted and measured yields an objective that considers differences in orders of magnitude. Numerically solving to minimize this objective yields the results shown in Table 2.

Table 6. Individual Link 5 Day One Mbps Performance

Link Avg

Latency (ms)

Max Sus. Latency

(ms)

Max Latency

(ms) Total PLR

Worst PLR

Max Burst Loss

Total ROR

Worst ROR

Max OOO

Time (ms)

Dedicated Network 6 7 217 2.2E-08 1.1E-05 1 0 0 0

Corporate Network 33 60 309 2.3E-04 9.1E-04 13 2.7E-08 1.3E-05 1

Internet to N. Virginia 16 28 770 3.2E-08 7.7E-06 1 0 0 0

Internet to Sao Paulo 86 115 999 4.4E-07 3.1E-04 13 0 0 0

Internet to Orgeon 15 35 969 4.5E-06 5.6E-04 8 3.6E-08 3.6E-08 2

Internet to Sydney 82 102 961 2.8E-05 1.8E-03 91 2.9E-08 2.9E-08 3

Internet to Frankfurt 65 112 1429 3.4E-04 1.7E-02 125 1.6E-06 5.8E-06 5

Internet to Central US 1 9 1209 1.2E-03 1.3E-02 11 1.6E-04 9.8E-04 966

Internet to W. Europe 56 96 173 1.4E-03 1.2E-02 363 2.0E-05 1.7E-04 524

Internet to East Asia 75 90 75 3.8E-05 3.3E-03 73 1.1E-04 7.9E-04 22

Internet to New York 19 42 29 9.8E-07 1.8E-03 56 0 0 0

As the dedicated network is relatively error free, most of the internet link results are similar to the end-to-end dedicated network results. However, it is important to note the differences between the dedicated and corporate network connections. In this case, the dedicated network had a PLR four orders of magnitude better than the corporate network. Additionally, the corporate network experienced one forty-five minute outage and one



fourteen second outage during the five day period. Both of these losses were removed from the data set as they eclipsed the other loss events.

Data Rate To determine if the 1 Mbps test could be extrapolated to other data rates, a similar test was run a few weeks later at 3 Mbps over the dedicated network and produced the results in Table 3.

Table 7. Three Mbps Test


Latency (ms)

Max Sus. Latency

(ms)

Max Latency

(ms) Total PLR

Worst PLR

Max Burst Loss

Total ROR

Worst ROR

Max OOO Time (ms)

Amazon N. Virginia 23 23 99 0 0 0 0 0 0


Amazon Oregon 21 21 118 3.7E-07 1.0E-04 8 1.5E-07 5.1E-05 30

Amazon Sydney 91 93 1268 4.0E-05 1.3E-03 193 0 0 0

Amazon Frankfurt 75 79 447 6.6E-06 7.1E-05 1 0 0 0





While average latency and re-ordering characteristics are similar to the 1 Mbps results, the packet loss characteristics are quite different. The Frankfurt link at 1 Mbps had a 20 minute period of 8% loss. At 3 Mbps, there was no such period of loss, and the overall PLR was 100 times less. In contrast, at 1 Mbps, Sao Paulo experienced a single loss of 17 consecutive packets. At 3 Mbps, a two minute period of 25% loss resulted in a PLR 400 times worse. However, a 4 hour, 28 Mbps stream to Sao Paulo only lost six packets total. Thus, it seems less that different data rates cause different loss characteristics, but more that links are susceptible to unpredictable and dramatic impairments. When present, these impairments dominate the link results.



Packet Size To test the effect of packet size on link characteristics, DataDefender simultaneously sent three streams to each cloud. While each stream had the same data rate of 1 Mbps, the packet size was set to 500, 1500, and 9000 bytes.

Reducing the packet size to 500 bytes had different effects on different links. For East Asia, Western Europe, and Sydney, the smaller, more frequent packets experienced 40%-80% less loss than the baseline 1500 byte packet. For East Asia and Western Europe, re-ordering was reduced by a factor of 220 and 5, respectively. Sydney's already low re-ordering rate was not improved by the smaller packet size.

In contrast to these links, Virginia, Sao Paulo, Frankfurt, Central US, and New York reacted negatively to the 500 byte packet stream. PLRs for these links ranged from 50% worse for Sao Paulo to 12 times worse for Frankfurt. Virginia, Sao Paulo, Frankfurt, and New York did not see any re-ordered packets with either the 500 or 1500 byte streams. However, the Central US link's already high re-ordering rate of 9E-5 was increased by a factor of 15. Central Oregon showed similar PLRs and re-ordering rates for both the 500 and 1500 byte packet streams. All links showed an increased maximum burst loss with the smaller packet size. On average, links lost 3.1 times more packets consecutively, which makes sense as the packet rate was three times greater.

While reducing the packet size to 500 bytes caused mixed results, a packet size of 9000 was unequivocally bad. The three Google sites did not receive any data at all from these streams. Research confirmed that the Google Compute cloud does not support jumbo frames or IP fragment reassembly. While the Amazon and Digital Ocean VMs received the fragmented 9000 byte packets, the loss characteristics were much worse than the 1500 byte baseline. Oregon was the least impacted with an increase in loss by a factor of 7.7, while Frankfurt experienced an 1828-fold increase in loss ratio. On average, the maximum burst loss was increased by a factor of 2.9. As the 9000 byte packet stream sends packets one-sixth as often, this represents a 17-fold increase in outage time. Surprisingly, Sydney experienced just one-tenth the maximum burst loss seen with the 1500 byte packet.

The different Maximum Transmission Unit (MTU) sizes had a slight effect on average latency. For the 500 byte packet, latency was reduced by an average of 0.7 ms, and increased by 2.5 ms for the 1500 byte packet.



Uplink Test In contrast to the previous tests, which focused on sending data to the cloud, the uplink test measured the feasibility of receiving an IF stream from a cloud-based modulator. Each cloud site was configured to send a 1 Mbps stream to the Englewood facility. Over the course of 56 hours, DataDefender recorded the measurements in Table 4. The three Google locations could only sustain a download rate of 500 Kbps, and so were not included. It is possible further configuration of these VMs would have allowed greater throughput.

Table 8. One Mbps from Cloud 56 Hour Test


Latency (ms)

Max Sus. Latency

(ms)

Max Latency

(ms) Total PLR

Worst PLR

Max Burst Loss

Total ROR

Worst ROR

Max OOO Time (ms)

Amazon N. Virginia 22 26 52 5.3E-05 2.7E-03 139 0 0 0


Amazon Oregon 22 25 1161 6.2E-06 1.0E-03 12 0 0 0

Amazon Sydney 90 93 1318 3.4E-05 2.0E-03 61 1.7E-06 3.8E-05 2

Amazon Frankfurt 75 95 1220 9.6E-05 2.4E-02 1090 0 0 0


While the latency data matches the previous results, the streams coming from the cloud had much higher loss characteristics. In general, these high PLRs were driven by a few brief but extremely lossy periods. For instance, removing the worst three ten minute periods of loss improved the New York PLR by two orders of magnitude.

Interpretation These test results show that loss characteristics vary greatly from link to link. Table 5 shows that three of the Amazon locations received data with a better than 1E-7 PLR for at least 97% of the ten-minute periods considered. On one of these “good” links, loss-sensitive protocols such as UDP and TCP will deliver data most of the time. However, even on these links, sudden one-second delays and unpredictable periods of loss will result in an interruption of data flow. On the other, less reliable links, consistent loss will make UDP unusable and keep TCP from reaching the necessary throughput.



Tools to Overcome Impariments To send data across long-haul, impaired links reliably and with minimal latency, it is necessary to have a strategy for dealing with each of the observed impairments. Further, as the impairments are known to vary from link to link, it is advantageous to have implementation strategies that are flexible such that they can be tuned to each link. For overcoming packet loss, Packet Forward Error Correction (PFEC), retransmission, and redundant paths are discussed. Next, techniques for establishing a re-ordering window to handle packet ordering and duplication are considered. Finally, packing and datagram MTU enforcement are presented as a solution to fragmentation.

Packet Forward Error Correction One way to overcome packet loss is Packet Forward Error Correction (PFEC), laid out in RFC 34531. With PFEC, additional information is included in the packet stream. When the receiver detects a packet loss, it uses this extra information to reconstruct the dropped packet. If too many packets are dropped, the receiver cannot recreate the dropped packets, and data is lost.

While there are a variety of PFEC algorithms, the core idea is to take a set number of data packets, known as an input block, and perform a mathematical operation on them to produce a larger set of packets, or an output block. The input block can be recreated from a subset of

1 Luby, M., Vicisano, L., Gemmell, J., Rizzo, L., Handley, M., and J. Crowcroft, "The Use of Forward Error Correction (FEC) in Reliable Multicast", RFC 3453, December 2002

Table 5. Ten Minute Periods with Less Than 1E-7 Loss for 1 and 3 Mbps Tests

Provider Location Better than 1E-7

Amazon N. Virginia 99.81%

Amazon Sao Paulo 99.25%

Amazon Oregon 97.83%

Digital Ocean New York 90.28%

Amazon Sydney 89.36%

Amazon Frankfurt 88.15%

Google East Asia 38.87%

Google W. Europe 31.99%

Google Centraul US 0.09%



the output block. The number the output block packets required to regenerate the input block can be no smaller than the input block. Thus the difference in size between the output and input blocks, or FEC overhead, is also the maximum recoverable loss. Some FEC algorithms are position agnostic in that they can reconstruct loss as long as the minimum number of required packets is received, regardless of where in the block they are located. In contrast, other FEC algorithms are position dependent and can only repair loss that occurs in a specific pattern, such as consecutive loss.

Because the last packet of a block transmitted may be required to repair the loss of the first packet in the group, the worst case latency of PFEC is the time between the transmission of the first and last packets of the block plus the forward delay of the link. Thus, decreasing the number of packets in a block decreases latency, but may increase overhead. Additionally, a smaller block is more easily wiped out by a packet loss burst.

As all the information necessary for the repair is present in the forward stream, PFEC does not require a back-channel. However, the transmitter and receiver both need enough memory to hold a full block’s worth of packets, and the necessary computational horsepower to run the encoding or decoding algorithm at rate. Position agnostic algorithms tend to be much more computationally expensive than their position dependent counterparts and may require dedicated hardware.

The simple XOR algorithm described in RFC 3453 section 2.1 is implemented by grouping packets into an m x n matrix. For each row, an extra packet parity is created by XORing the packets in the row together. Likewise, an extra parity packet is added for each column. Thus, any row of the matrix can be recovered from n of n+1 packets, and any column can be recovered from m of the m+1 packets. However, utilizing only the column parity packets and not the row parity packets reduces the computational load by half, as only one XOR is executed for each packet. Further, this does not dramatically reduce correction capability, as the maximum correctible burst is still limited to n, the number of columns. The FEC overhead is dependent on the number of rows, and latency is dependent on the number of packets in the matrix. Thus, by adjusting m and n, it is possible to control overhead, burst protection, and latency.



Retransmission The second and most straightforward way to repair packet loss is for the receiver to signal the transmitter to retransmit the lost packet. TCP is the most prevalent retransmission protocol. When a retransmission is in progress, the receiver stops outputting further packets until the retransmission is received. The time required for this repair is equal to the Round-Trip Time (RTT), which is the sum of the backchannel and forward channel transmission delays. If the retransmission request or the retransmission itself is dropped, the repair time will be greater. Thus, the latency incurred by retransmission is on the order of the link delay. As each dropped packet results in a retransmitted packet, the overhead bandwidth added by retransmission is dictated by the packet loss rate.

Retransmission imposes some requirements on the communications system. First, there must be a backchannel from the receiver to the transmitter. Second, the transmitter needs sufficient memory to allow it to retain a copy of the transmitted packets such that a lost packet could be retransmitted. Finally, the receiver must have enough memory to store any packets received after requesting a retransmission but before receiving the retransmission.

There are two strategies employed with retransmission. The first is Guaranteed Delivery, which is implemented within TCP. As its name implies, this strategy guarantees that every data packet sent will be received. This is achieved by requiring the receiver to inform the transmitter of every packet successfully received. Until the transmitter has received this positive acknowledgement for a given packet, it must retain a copy of that packet so that it could be retransmitted if necessary. Because a retransmitted packet could itself be lost, there is no theoretical limit on how long it could take to successfully retransmit a lost packet. To make optimal use of the link bandwidth, the transmitter must continue to send packets while the repair is in progress. These packets must be stored in memory at the source in case they need to be retransmitted. They must also be stored at the receiver so they can be output once the repair has completed. As memory is a finite resource, a packets-in-flight limit is used to bound the maximum number of unacknowledged packets. This limit dictates the memory requirements of both the transmitter and receiver.

In the second strategy, instead of requiring positive acknowledgement for every successfully transmitted packet, the receiver can inform the transmitter only about lost packets. In this strategy, called Best Effort, the transmitter only saves copies of the transmitted packets for a



fixed amount of time. If the receiver requests retransmission on a packet the transmitter no longer has, the packet is lost, and the receiver moves on. In effect, the maximum number of retransmissions of a single packet is now limited. Unlike Guaranteed Delivery, it is possible to lose data with this strategy. For a loss to occur, a packet’s retransmissions must exceed the maximum allowed retransmissions. Thus, the probability of loss uses Formula 1 below.

PLR ^ (Max retransmissions + 1) (1)

Where;

PLR Packet Loss Rate

IRP is useful for latency sensitive applications, where late data is useless and it is better to lose data and move on than to continue trying to repair an expired packet. Additionally, because Best Effort does not rely on positive acknowledgements, it can be used with multiple receivers, making it applicable in multicast environments.

Protocols like TCP employ Congestion Avoidance (CA) algorithms, which detect packet loss and attempt to limit further loss. CA interprets packet loss as a sign of congestion, and reduces throughput, allowing other flows to share the link. Flow rate fairness has been generally useful for internet traffic in the past, but it is currently the topic of some debate2,3. For a real time, IF packet flow, CA reduces throughput below the required bandwidth, and adds massive latencies as source-side buffers back up. Such latencies may be unacceptable for many applications, such as unmanned vehicles.

While CA is deleterious for real-time, high-bandwidth flows, emitting packets as fast as possible is also not helpful, as retransmitted packets may exceed the total capacity of the link and result in a persistently high PLR. Instead, by limiting the stream throughput to a fixed bandwidth, usually 1-5% more than the nominal rate, repair occurs in a timely fashion without overwhelming the link. While this is not a congestion control protocol, it does

2 B. Briscoe, Flow Rate Fairness: Dismantling a Religion, ACM SIGCOMM Computer Communication Review, Vol.37 N.2, April 2007.

3 Floyd, S. and M. Allman, "Comments on the Usefulness of Simple Best-Effort Traffic", RFC 5290, July 2008.



control the rate packets are sent in accordance with best practices4. Further, the impact of such an approach is no worse than a PFEC stream sending 1-5% FEC.

Redundant Paths A final way to overcome packet loss is to send duplicates of every packet down a redundant path. If the receiver detects a packet loss on one link, it looks at the redundant path. If the same packet is dropped by both paths, the packet is not recoverable and is lost. The probability of loss is thus the product of each path’s packet loss rate (PLR): PlrA * PlrB. Because the system may rely on either path, the latency incurred by redundant paths is the maximum latency of the two links. Since each packet is transmitted twice, the total overhead bandwidth is 100%.

Re-ordering Window Out of order packets can be corrected by establishing a re-ordering window. This window utilizes a sequence number in each packet to determine if any sequence gaps have been created by missing packets. When a sequence gap is detected, the output of the window is halted while the window waits a configurable amount of time for the missing packets. If the missing packets arrive, they are emitted, and all the contiguous sequence number packets available are released. If the time expires, the missing packets are considered lost and output resumes, skipping the lost packets. Duplicate packets or missing packets that arrive after the window has moved on are discarded. Because such a window limits the time output is halted waiting for missing packets, the maximum time any received packet is delayed is bounded, regardless of packet rate. The advantage of a configurable, latency-aware re-ordering window is that latency can be incrementally traded for more robust re-ordering correction, which allows for an optimal configuration. Further, this re-ordering window can be used in conjunction with the packet loss protection techniques discussed earlier.

4 Eggert, L. and G. Fairhurst, "Unicast UDP Usage Guidelines for Application Designers", BCP 145, RFC 5405, No-vember 2008.



MTU Enforcement and Datagram Packing Many components of an IP transport network have a Maximum Transmission Unit (MTU). This is the size of the largest packet that will be transmitted. Larger packets will either be dropped outright, or broken into smaller layer 3 packets, called IP fragments. IP fragments are re-assembled by the end-receiver, and if any fragment is missing the entire original layer 4 packet is lost. As seen from the results earlier, sending packets larger than the path MTU increases loss and re-ordering rates at best and results in a total loss at worst. Thus, it is advantageous to limit the maximum layer 4 packet size and manually break larger packets into smaller layer 4 packets.

In addition to MTU enforcement, it is useful to fill each packet to the maximum layer 4 size before release. By aggregating smaller packets into a single larger packet, the packet rate is decreased, which can reduce the probability of loss and re-ordering. Further, by increasing the average packet size, overhead due to constant length headers is reduced. MTU enforcement and datagram packing can be combined with the packet loss and re-ordering techniques described earlier for maximum benefit.

Results of Loss Mitigation Techniques RT Logic’s DataDefender implements the loss mitigation strategies described above. The effectiveness of these techniques was measured by enabling the various packet protection features on the DataDefender and measuring the results.

PFEC Because the Virginia, Oregon, East Asia links displayed relatively low burst losses, they were chosen as candidates for the column only PFEC algorithm described earlier. Additionally, a 1500 byte packed frame and a 5 ms re-ordering window were used for all tests. As before, a 2 Mbps PRBS was modulated with QPSK. This signal was digitized, resulting in a 28 Mbps stream, and sent to a cloud instance of a software modem.

As the Virginia link had not observed any burst loss, a DataDefender was configured with a PFEC matrix configuration of 50 x 2. This configuration can correct a burst loss of two, and adds 2% overhead and 100 packets worth of worst-case latency. Over the course of 24 hours, the software modem maintained lock, receiving 177 Gb with 1651 bits in error for a BER of



9.3E-9. Reviewing the link performance revealed that 3 packets were delivered out of order, and there were two loss events, the first a single packet, and the second a three packet burst. The re-ordering window corrected the out-of-order packets and PFEC corrected the single packet loss. However, because of the chosen configuration, PFEC could only correct two of the packets in the three packet burst, causing the 1651 error bits.

The Oregon link was tested with a PFEC configuration of 25 x 10, which can correct a burst loss of 10 and causes 4% overhead. Over the course of 24 hours, DataDefender’s PFEC corrected 6 out-of-order packets and 11 single packet losses. However, PFEC could not overcome the loss of 99 consecutive packets, and the software modem lost lock.

For the East Asia link, a 13 x 10 configuration was attempted. However, frequent burst losses of 30 or more consecutive packets made this configuration useless. Adjusting to a 7 x 110 matrix yielded better results. Excluding a consecutive burst loss of 1354 packets, PFEC was able to repair 99.57% of the 28,611 packets lost. The unrepaired packets were caused by multiple loss events within the same matrix, which PFEC cannot always correct. Thus, RT Logic’s DataDefender PFEC improved the link PLR from 5.6E-4 to 2.1E-6.

Retransmission IRP is DataDefender’s guaranteed delivery, rate-limited, UDP-based retransmission protocol, and was used instead of TCP for the reasons described earlier. IRP was deployed to send the packetized QPSK signal to the Frankfurt and Sydney sites. The nominal data rate for this stream was 28.3 Mbps, and the rate was limited to 30 Mbps, allowing a maximum overhead of 6%. As with the PFEC tests, a 1500 byte packed frame and a 5 ms re-ordering window were employed for all tests.

Over the course of 24 hours, the Frankfurt link experienced a PLR of 2.7E-6 and a re-ordering rate of 1.8E-5 and included a 20 second period of 3.4% packet loss. Despite this, the DataDefender corrected all errors, and the software modem demodulated 184 Gigabits without any bit errors or lock losses. During the highest period of loss, even though the link RTT time increased to exceed 500ms, data was received by the modem with less than one second of latency.

Using IRP to send the stream to a software modem in Sydney for 60 hours was equally successful. The link has a PLR of 3.6E-5 and a re-ordering rate of 1.0E-8. DataDefender was



able to repair all 21,100 dropped packets, and the modem demodulated 459 Gigabits without error. On one occasion, 1795 packets, or 718 ms worth, were consecutively dropped, but DataDefender repaired these losses while maintaining an end-to-end latency of 920 ms.

Lessons Learned First, it is recommended all data flows be actively monitored by including a unique sequence number in each packet as well as periodic timing information. This allows the detection of packet loss, re-ordering, and changes in end-to-end latency. However, knowing just the PLR, current network latency, and re-ordering rate is often not enough information. It is useful to record the average and maximum consecutive burst loses, the minimum and average distance between loss events, the average and maximum re-ordering time and packet distance, and minimum, average, and maximum network and application latencies among others. Ideally, this set of statistics should be sampled and recorded periodically, allowing observation of changing link conditions with time. Such a full set of measurements is easy to collect using DataDefender, and is invaluable for troubleshooting losses and determining optimal protection strategies.

The second lesson learned is that there is no typical network. Observed average packet loss ratios varied from 1E-3 to lossless. Even for networks with similar average PLRs, different distributions of loss events have dramatic effects on performance. As mentioned earlier, Frankfurt and the Central US links showed similar PLRs, but one was consistently mediocre while the other was characterized by long periods of excellent performance punctuated by brief periods of severe loss. Furthermore, link performance can change over time. As with many things, past performance is not necessarily indicative of future results. Thus, instead of depending on a certain performance characteristic, it is advantageous to leverage protection strategies that can operate over a wide variety of link conditions. Ideally, the response of a protection strategy to degrading link conditions should be linear or better.

Next, when deciding between PFEC and retransmission protocols, it is important to consider mission constraints. As PFEC does not require a backchannel, it is the obvious choice if there is only a forward channel. Even if there is a backchannel, when the RTT is large and burst losses are small, PFEC can repair losses with less latency than a retransmission protocol. However, as seen in the previous test results, on some links it is impossible to configure PFEC to protect against all loss. Thus, PFEC is most useful over high-latency links where



minimizing additional latency is king and some loss is tolerable. In contrast, a retransmission protocol can correct any loss profile by adding some additional latency for repeated retransmissions. As this additional latency is on the same order of magnitude as the link transmission delay, there are many cases where this is acceptable. Further, because only lost packets are retransmitted, the correction overhead matches the PLR. Thus, a retransmission protocol has lower overhead than a PFEC configuration capable of repairing the observed loss.

When considering a virtual cloud provider, there are several factors to consider. First is cost. Because cloud providers allow free uploads, but charge for download bandwidth, cloud downlink demodulation is cheaper than uplink modulation. At the time of this writing, the Digital Ocean was much less expensive than Amazon or Google as shown in Table 6.

Table 6. Monthly Cost Comparison of Cloud Providers for 2 Mbps QPSK

System VM Uplink Downlink Total

Digital Ocean 4GB/2CPU $43 $101 $0 $145

Amazon c4.large $144 $816 $58 $1.019

Google n1-standard-2 $94 $1,361 $97 $1,552

Additionally, Digital Ocean was the easiest to configure. Google and Amazon offer a richer set of features, though none of them were utilized for this demonstration. The connections to the Google data centers displayed lower fidelity than the Amazon and Digital Ocean centers. It is possible that this was the result of a configuration issue and is the subject of further research.



Conclusion While an IP-based digital IF architecture offers many advantages, the use of IP networks for transportation is not without challenges. First, traditional protocols offer little insight into the latency, ordering, and loss characteristics of the link. Without knowing the cause of data loss, it is difficult to know what sort of solution will be the most helpful. Secondly, impairments on public internet and other links can make TCP and UDP unusable. Latencies can temporarily increase by more than a second. Packet re-ordering rates can temporarily exceed 1.0E-3 and deliver packets hundreds of milliseconds later than expected. At times, loss rates may exceed 1.0E-2 and the link can experience short-term outages of several seconds. However, it is possible to overcome these impairments and operate a lossless IP-based architecture within the cloud using the tools provided by RT Logic’s DataDefender.

With this in mind, RT Logic makes the following recommendations:

• Actively monitor all flows for loss, re-ordering, and latency variation

• Use transport protocols that operate over a wide range of link conditions

• Minimize loss and re-ordering by controlling packet size and rate

• Utilize a constant latency re-ordering window to fix out-of-order packets

• Implement PFEC to improve link performance while maintaining minimal latency

• Deploy retransmission for lossless delivery in the face of large burst losses

• Choose cloud providers based on the features and performance required

migrating c2 functions to an ip-based virtual architecture...

Documents