Saturday, September 12, 2009

Automation

I did not know it then, but 25 years ago I began a career path in automation. I had worked for the local cable TV company for several years and became very good at resolving technical issues with the cable TV system. I also became interested in programming PCs. I quickly found that I could use this combined knowledge to make my job easier. My first few projects were text based utilities that would go to a piece of equipment, check something, determine if the result was okay, and if not print out the result on a computer. A technician (me) still needed to check the printer and investigate anything that had printed out.

Then came Microsoft Windows and Visual Basic. That enabled much prettier and sophisticated software development. My first major application was to communicate with a couple hundred smart devices distributed throughout the cable TV network. These devices would report back their status and a few basic measurements. This was information that normally would be retrieved by driving a truck to the location, climbing a pole with test gear, opening up a weather tight housing, making the measurements and returning to the truck.

The communications application provided by the hardware vendor was text based, and very difficult to use when troubleshooting the network. My version was much prettier. It started with a map window that would adequately depict the location of each smart device. A hierarchy was established so that the system knew what devices fed the next. With this hierarchy set, I could draw the paths to each device and determine the most significant device in a series of failed devices (root cause).

So my first automation system consisted of a map with devices color coded by their current status, placed at the proper location and connected in the proper order. This enabled alarming when certain situations were identified. I started to look for a series of 2 or more devices in failure condition as this would indicate a branch of the network was not working. Audible messages were passed to the technical dispatchers to catch their attention and inform them of what was identified as significant network event.

The system could only detect problems but could not resolve them. A technician still needed to roll to the location and begin troubleshooting. The process was much easier since they would already know what devices were not functioning, and which were okay. This avoided much of the isolation time needed in troubleshooting. Technicians could roll straight to the problem, or at least get very close.

I did not realize it then, but this project was foundational to future career assignments. Flash forward 20 years. The alarming systems used by telecommunications companies are very mature. Tens of thousands of alarms or conditions are collected each day from thousands of network elements. Some alarms indicate a significant network failure, while others are merely informational. Collecting and displaying this information is handled well, but there are far too few people to review the number of alarms generated. What has never been done well is creating systematic reactions to each alarm that avoids the necessity for human action whenever possible. That's what I now do.

Our first application listens for a specific alarm that indicates a electronic card at a cellular telephone site has failed. The failure results in dropped call or the inability to initiate a call by our customers. There are thousands of cell sites in the network and many of these cards at each site. It is really a problem begging for a solution.

The application initiates an automated routine that first checks to ensure the card is really out of service. If so, it attempts to reset (think reboot) the card. If that is successful, everything is great so we count the success and move on. If the card cannot be reset, a ticket is created and dispatched to the appropriate field technician so they can replace the card. The application also creates a ticket if the same card continues to fail even if the failures are resolved by resetting them.

That is an elegant solution. One alarm condition down, thousands to go. The goal is to get to a "dark NOC" or using different words, an unmanned operations center. In reality, there would always be personnel monitoring the automation systems, but the heavy lifting would be done by machines. There is also no avoiding the field technician. While we can decrease their workload, there will always be a need for eyes, arms and legs at the location to replace equipment that cannot be remotely repaired.

We all have seen science fiction stories where machines do the work of men. We are getting very close, but you cannot see how close until you step back and see just how far we have come in a few years. Thank you, HAL.

Why, you are very welcome Frank.

No comments:

Post a Comment