World Wide Packets 

OAM: Operations, Administration and Maintenance

OAM Process Flow

 The figure to the right describes the Service Provider process flow when faults appear in the network. This starts with the fault itself and ends after the verification of the repair. Each step needs to be optimized in order to protect both the Service Provider and the Subscriber.

Fault Detection

Fault Detection includes mechanisms to detect faults at the device control-plane or data-plane level. Faults need to be detected quickly enough to minimize Time to Recover (TTR). However, detection needs to be based on an observation window large enough to avoid false fault detections. For example, a control-plane can become non-responsive for a few microseconds while handling a burst of interrupts. As long as the control-plane gets back to a normal state within an acceptable time window, the network element is not experiencing a software failure.

OAM handles a wide range of failure scenarios that vary in nature and location, from a defect in software to a backhoe tearing apart a fiber conduit by mistake. To simplify the discussion, three major categories of failures are used:

  • Link Failure
  • Service Transport Failure
  • Service Level Agreement Failure

Link Failure

Link failure represents either the complete failure of a link or the performance of a link degrading below an acceptable level (e.g. bit error, CRC). The causes include an optical transceiver failure at either end of the link, dust or other impurities getting into the connector, a cut of the fiber between the elements, or the failure of the element at the other end of the link.

World Wide Packets’ LightningEdge solution has been optimized to enable network reconvergence below 50 milliseconds. These enhancements allow Ethernet service delivery networks based on LightningEdge products to support critical, time-sensitive applications with the same service level agreements and guarantees of SONET/SDH optical rings. This level of performance is achieved, in part, by providing high-priority interrupt-based failure detection, shielding services from link level failures.

Service Transport Failure

Ethernet services can be transported natively, using VLANs (IEEE 802.1Q) or stacked VLANs (802.1ad), or using MPLS tunnels and MPLS virtual circuits (VCs). Each one of these transport mechanisms can fail due to software failure, memory corruption, or simply due to misconfiguration.

World Wide Packets, through the early adoption of IEEE 802.1ag Connectivity Fault Management (discussed in a later section), provides VLAN-based service transport OAM. The combination of Label Switched Path (LSP) ping, LSP traceroute, Virtual Circuit Connection Verification (VCCV), Bi-directional Forwarding Detection (BFD) and Fast ReRoute (FRR) provides comprehensive MPLS-based service transport OAM.

World Wide Packets True Carrier Ethernet offering is the only access/metro edge solution that enables service providers to deploy any mix of Ethernet and MPLS-based service transports over a common infrastructure. This allows service providers to easily migrate from Ethernet to MPLS access deployments and extend the services and capabilities of an MPLS core network directly to subscribers with no additional capital investment required.

Service Level Agreement Failure

The characteristics of the services provided by carriers to their subscribers are described in the SLA. Adherence to the SLA can be measured using one or more of the following metrics:

  • Frame Delay – delay experienced by the traffic carried by the service
  • Frame Delay Variation – variation in that delay
  • Frame Loss – percentage of frames passed through the service that were dropped by the network
  • Service Availability – percentage of time where the service is available to the subscriber

Monitoring these SLA parameters provides indications of fault or performance issues. Performance Management of Ethernet services is being defined by the Metro Ethernet Forum (MEF) and the ITU-T. This white paper focuses on the Fault Management aspect of SLA failures.

SLA failures can be caused by: link failures—such as a failing optical transceiver resulting in partial packet loss, or a service transport failure—such as a software failure leading to incorrect forwarding tables.

World Wide Packets’ LightningEdge solution offers intelligent classification and queue servicing, which minimizes frame delay and frame delay variation. In addition, World Wide Packets provides a unique set of self-healing techniques at the link layer and service transport layer, to minimize SLA failures relating to frame loss and service availability.

Fault Notification

Once the fault is detected by the network element layer, the fault needs to be conveyed to the entities that will work towards repairing the fault. Such entities can require human involvement, like the manual replacement of a faulty transceiver, or automated like a Rapid Spanning Tree Protocol (RSTP) reconvergence after a link failure. In any case, Fault Notification needs to be:

  • Responsive – the time saved will protect revenue and may avoid penalties.
  • Meaningful – a mere ‘link down’ SNMP trap sent when an optical transceiver fails is insufficient. A trap containing information regarding the faulty transceiver and the reason for the failure reduces troubleshooting cost.
  • Concise – sending multiple traps with redundant failure information will obfuscate the real cause of the failure and slow down the fault isolation step.

World Wide Packets provides a comprehensive solution for optimum fault notification, including high-priority generation of SNMP traps with a content focused on failure source. In addition, World Wide Packets Network Management solution offers alarm correlation capabilities enabling network operators to associate alarms to more quickly isolate the cause of the fault.

Fault Verification

After notification, the Network Operation Center (NOC) engineer should verify the fault and whether or not the condition persists. By the time the link fail indication is received, the Ethernet network will have already reconverged. Under most conditions, failover and restoration with World Wide Packets True Carrier Ethernet devices takes less than 50 ms. Fault Verification using on-demand OAM techniques is a step taken to eliminate false failure indications. Not verifying the validity of the fault could lead the network operator down the path of trying to isolate a failure that does not exist.

Fault Isolation

Fault isolation consists of determining the exact source, location and nature of the fault. This includes the specific network element(s) and network layer(s) experiencing the fault. A failure at a low level may impact higher levels and lead to additional failures. For example, a link failure can lead to broken MPLS tunnel connectivity also impacting all of the MPLS VCs it carries.

World Wide Packets offers a complete on-demand OAM solution enabling the network operator to conduct layer-by-layer fault isolation (link, service transport and service level agreement layers). The preceding figure shows the extent of the various OAM mechanisms useful for isolating faults.

Figure - Major Network Fault Categories

Notification of a low-level failure can be followed or surrounded by higher level failure notifications. This makes fault isolation more difficult, more time consuming and more costly. Features such as alarm correlation provided by World Wide Packets’ management ems help minimize the cost of isolating a fault by decreasing the number of fault notification messages.

Repair

Depending on the efficiency of the OAM process, repair and preventative maintenance can occur at different stages:

  • After the fault impacts the service – Time-to-repair is most critical, as the network operator needs to quickly remedy the problem to restore the service. World Wide Packets conceived the True Carrier Ethernet solution providing modularity in the network elements. This enables the network operator to only change the failed component saving time and eliminating impacts to other services. For example, the failure of a hot-swappable transceiver does not require the replacement and re-cabling of the entire network element, eliminating risk of error in the process.
  • Before the fault impacts the service – Redundancy enables proactive maintenance, significantly reducing service outage times. World Wide Packets modular solution coupled with redundant links, control modules, power supplies and fans, allows non-invasive repair of network components, protecting the services that they carry. For example, the failure of a redundant control module will only lead to non-invasive switchover to the standby module.
  • Before the fault leads to an element or network failure, i.e. performance degradation scenario – By continuously monitoring key metrics relating to element and network health, service providers can preemptively schedule maintenance. This activity uses fewer resources than a reactive approach.

Repair Verification

After a remedy is enacted, the same on-demand OAM mechanisms used during Fault Verification confirm that the fault no longer exists. An IP ping can be used both to verify IP connectivity fault on the control-plane, as well as restoration of the connectivity.

Continue