Software’s were dealing with, despite tipoffs from

Software’s Role in the Northeast Blackout of 2003

              On August 14, 2003, a massive blackout caused 50 million people to lose power and led to an estimated $6 billion in costs (Minkel, 2008). This blackout, known as the Northeast Blackout of 2003, occurred due to a series of unforeseen events, including an alarm system failure that led to server failures and other issues. Power went out in parts of “Ohio, Michigan, New York, Pennsylvania, New Jersey, Connecticut, Massachusetts, Vermont, and the Canadian provinces of Ontario and Québec” during the outage (NERC Steering Group, 2004). While it is difficult to tell if the failing alarm system was the biggest cause of the situation, it is safe to say that it was among the most significant contributors to the cascading power shutdowns.

How the Blackout Grew

            Many faults occurred on that day that eventually led to the blackout. The first event that directly contributed to the power loss occurred when the Eastlake, Ohio power plant belonging to FirstEnergy Corporation shut down after one of its power lines contacted a tree. From here, more and more power failures started to occur (History.com Staff, 2009). Typically, events like this can be isolated, but without the alarm system running correctly, operators were left unaware of the magnitude of the problem.

            While the electrical issues caused by the failing power line allowed for this problem, it is the software bug in General Electric’s XA21 system that allowed the situation to get out of control. In day to day operations, FirstEnergy employees relied heavily on on-screen alerts and alarm noises to keep power systems running correctly (Poulsen, 2004). Without the alarm system working properly, operators had no idea what they were dealing with, despite tipoffs from multiple phone calls as other plants showed signs of failure (NERC Steering Group, 2004). The failure destroyed the ability of operators to make decisions that could have prevented the spread of the blackout.

Software’s Role

`           The software bug involved in the alarm system failure stemmed from what is known as a race condition in the XA21 system. A race condition occurs when multiple systems or parts of a system try to access the same piece of data at the same time and “race” to update that data (MSDN, 2017). In this case, when the race condition occurred when the system with the alarm data locked up and prevented any further processing.  When the alarm systems stalled, operators were not only unable to receive vital information, but data started to pile up without being deleted leading to slower system performance (Jesdanun, 2004).  As data continued to queue up and overflow, remote terminals and a pair of server nodes ultimately failed, leading to even more problems.

In the FirstEnergy Energy Management System, one server was allocated as a backup should any of the servers set to run various applications fail. When the original server involved in much of the alerting and failure logging processes failed, a backup of the running software and the state it was in were transferred to a second server. Unfortunately, the stalled race condition was moved along in this process and the second server failed 13 minutes after the first one did. According to investigations after the event, having two servers down at once slowed the refresh rate of screens with various pieces of valuable information from the usual one-to-three second screen change times to as high as 59 seconds per screen (NERC Steering Group, 2004). With these slow times, operators were receiving pieces of information at an incredibly slow pace considering they often had to step through many screens to access the data they needed.

With the critical decision-guiding alarm systems down and other systems unbearably slow, it became almost impossible for technicians to make changes that could have protected the region-wide network that eventually fell. Failing alarm systems combined with no knowledge of the failure, poor communications, a lack of proper backup procedures, and various other extraneous issues eventually led to the cascading power failures that left 50 million people without power.

Cost of the Blackout

Shortly after the blackout, Anderson Economic Group published a report discussing the cost of the blackout, concluding that the event caused an estimated $6.4 million in losses. Of this significant amount of money, it was estimated that $4.2 billion was lost just in income of workers and investors because of circumstances caused by the power outage. Primarily due to a lack of power for refrigeration, it was estimated that another $380 million to $940 million was lost to spoilage or waste (Anderson and Ilhan, 2003). Fortunately, property damage and theft were not abundant during this disaster and did not bring a significant cost, unlike during the similar power outage that occurred in 1977 (Barron, 2003).

Although the primary cost of the blackout was in terms of money and productivity, the event caused a few other side effects. Most notably, later investigations concluded that the power outage was involved in the deaths of at least 11 people (Minkel, 2008). According to a New York Times article published the day after the incident, “Emergency Rooms were flooded with patients with heat and heart ailments,” and traffic light outages led to pedestrians being struck by vehicles (Barron, 2003). Presumably, situations like these were included in the estimate of blackout-related deaths.

Three Primary Causes

With this picture, it becomes apparent that software plays a significant role in keeping catastrophes at bay but can also help cause them if it does not work properly. However, software alone did not cause the event. The report published after the investigation by the North American Electric Reliability Council (NERC) put the blame for the failure in three different groups: a lack of situational awareness stemming from the software bug, a failure to manage vegetation, and an inability to provide adequate diagnostic support (NERC Steering Group, 2003). In short, NERC discovered that if any of these three problems existed alone, a large-scale outage was much less likely to occur. However, these situations were all present and created a perfect storm.

With vegetation in the way of the power lines, power failures are more likely to occur. If a power line gets overloaded, which can happen when other lines get tripped, it can get hot and droop low enough to touch trees that are not appropriately managed. When this happens, the line that touches the tree also fails, concentrating power on an even smaller number of lines. On August 14th, three different tree related failures occurred within 30 minutes. NERC concluded that if the trees had been taken care of, this would not be a likely occurrence at all. However, other circumstances, such as a failure of a nuclear plant near the failing Eastlake plant could have also created an environment like the one the trees created (NERC Steering Group, 2003).

Regarding the failure to provide adequate diagnostic support, the main blame falls on the Midwest Independent Transmission System Operator (MISO). NERC primarily points out that MISO’s monitoring system was not providing real-time information to its operators when the original monitoring systems went down. Instead, MISO was using its Flowgate Monitoring Tool, which can lag far behind real time and was not intended for that purpose. Because of this, MISO and other organizations that relied on it for information were unable to act according to current observations (NERC Steering Group, 2004).

NERC’s final cause named the failing alarm system as an integral factor in the blackout. Since this problem led to so many others, it had a considerable impact on the effectiveness of operators to mitigate power failures before they led to further problems. The software bug, found in General Electrics XA21 Energy Management System, was not accounted for, so operators were unaware that anything was wrong with the alarms for more than an hour. Workers believed that systems were operating correctly without the alarms and did not know to look for information from another system (NERC Steering Group, 2004). The failing alarm System also caused a lot more problems when it brought two servers down with it and caused slower processing speeds. Ultimately, this failure prevented operators from making any of the vital decisions needed to fix the faults that eventually became the enormous blackout.

Who Takes the Responsibility?

As NERC pointed out, the great blackout was unlikely to occur without all three of their listed causes. Therefore, the failing alarm systems stemming from the software bug cannot take the entire blame. While the software on General Electric’s system did fail to work correctly, it had been running for years without any significant issues. The environment that led to the bug showing up was likely tough to predict. In fact, it took General Electric employees weeks of searching through the vast code base to find the bug (Poulsen, 2004). Another significant point to note is that the alarm system could have been fixed with a cold reboot of the XA21, but this action was not taken until the following morning despite being discussed when the alarm problem was initially found (NERC Steering Group, 2003).

Regardless of how likely this event was, it’s imperative for developers to inspect their software for potential failure points meticulously. Software may not have been the only error that day, but without the bug, it could have been the solution. If the software worked as planned, it is highly unlikely that the outage would have reached so far. As the blackout shows, even a bug that’s unlikely to occur can cause billions of dollars in damages or have other catastrophic effects. So while inadequate operations and foliage management played a role, the software bug should take a bulk of the blame. Software developers and managers must work together to avoid potential adverse outcomes.

While General Electric software developers ultimately created the bug that allowed the blackout to spread so far, the energy companies that used their systems had a lot of work to do as well. From NERC’s report on the blackout, it is notable that many important guidelines and procedures just didn’t exist when the outage occurred. No system was in place to make sure that the alarm system was working properly, and even if it was, no formal procedures telling operators what to do with that information existed. When the alarm system failed, no one knew what to do to fix it until hours afterward, in which time the power failures spun out of control. Attempting to fix these problems, NERC placed many recommendations for power companies to improve their efficiency in its report. Quite a few of these suggestions reference the need to create more procedures for when failures occur and to provide more training so that operators know what to do (NERC Steering Group, 2004).

Preventing Future Problems

Since the investigative report was released, many changes have been made to help avoid another similar power outage. A transmission control center was built in Ohio by FirstEnergy to keep watch over the extensive power network they control. Also, Vegetation management programs have been strengthened by many power companies to follow NERC’s recommendations. Cybersecurity measures have been put in place to prevent intentional attacks on the power grid after seeing what the XA21 did, and patches have been applied to Energy Management Systems like the XA21 more often. Although these steps do help reduce the risk of outages, NextGen Energy Council published a report stating that a blackout even worse than 2003’s outage is possible (Walton, 2016). To avoid that possibility, it is essential that software developer keep bugs like the race condition that caused the alarm system failure out of their code.

The Northeast Blackout of 2003, while devastating economically, created a learning opportunity. American and Canada have seen what can happen if proper care isn’t taken in software development processes. Ethically, developers have the duty to their companies, their countries, and their peers to write safe code for electrical systems. Operators and electric companies have the duty to put procedures and proper infrastructure in place as well. If the right precautions are put in place, it is more likely that a similar blackout can be avoided in the future.

 

References:

Anderson, P. L., & Geckil, I. K. (2003, August 19). Northeast Blackout Likely to Reduce US Earnings by $6.4 Billion. Retrieved January 25, 2018, from http://www.andersoneconomicgroup.com/Portals/0/upload/Doc544.pdf

Barron, J. (2003, August 14). THE BLACKOUT OF 2003: The Overview; POWER SURGE BLACKS OUT NORTHEAST, HITTING CITIES IN 8 STATES AND CANADA; MIDDAY SHUTDOWNS DISRUPT MILLIONS. Retrieved January 25, 2018, from http://www.nytimes.com/2003/08/15/nyregion/blackout-2003-overview-power-surge-blacks-northeast-hitting-cities-8-states.html

History.com Staff (2009, August 14). Blackout hits Northeast United States. Retrieved January 25, 2018, from http://www.history.com/this-day-in-history/blackout-hits-northeast-united-states

Jesdanun, A. (2004, February 12). GE Energy acknowledges blackout bug. Retrieved January 25, 2018, from https://www.securityfocus.com/news/8032

Microsoft Developer Network. (2017, January 7). Description of race conditions and deadlocks. Retrieved January 25, 2018, from https://support.microsoft.com/en-us/help/317723/description-of-race-conditions-and-deadlocks

Minkel, J. (2008, August 13). The 2003 Northeast Blackout–Five Years Later. Retrieved January 25, 2018, from https://www.scientificamerican.com/article/2003-blackout-five-years-later/

NERC Steering Group (2004, July 13). Technical Analysis of the August 14, 2003, Blackout: What Happened, Why, and What Did We Learn?. Retrieved January 25, 2018, from http://www.nerc.com/docs/docs/blackout/NERC_Final_Blackout_Report_07_13_04.pdf

Poulsen, K. (2004, February 11). Software Bug Contributed to Blackout. Retrieved January 25, 2018, from https://www.securityfocus.com/news/8016

Walton, R. (2016, August 23). 13 Years After: The Northeast Blackout of 2003 Changed Grid Industry, Still Causes Fear for Future. Retrieved January 25, 2018, from http://www.elp.com/Electric-Light-Power-Newsletter/articles/2016/08/13-years-after-the-northeast-black-of-2003-changed-grid-industry-still-causes-fear-for-future.html