Computer reliability is an important issue in all phases of petroleum computing. Not only is a computer failure annoying, it impacts productivity and can even result in disrupted or unsafe operations. We all want our computers to be the most reliable available, but is leading edge equipment really necessary for every application, and what are the associated costs with attaining such a goal?
Based on ongoing research into petroleum computing, which spans several years, this article examines the reliability of computer servers, running both Unix and Windows NT, and provides insights into this issue.
In the past two years, one of the most significant trends emerging in the computing sector of the oil and gas industry is the emergence of Microsoft 7 Windows NT7, now called Windows 20007, as a universal platform for petroleum computing. The petroleum industry has become dependant on high performance scientific software applications based upon Unix. However, in recent years, Windows has supplanted Unix in many industries such as aerospace, automotive, and pharmaceuticals.
With Microsoft Windows already dominant in the personal computing sphere and outselling Unix by 1,000:1, Windows NT promises to replace Unix machines in nearly every application area in petroleum computing. However, there remains some uncertainty regarding the reliability of Windows NT and whether or not it is robust enough to support mission critical applications - those where unscheduled downtime is not acceptable.
This article draws upon research in the petroleum computing sector gathered over the past 12 years to investigate how the two operating systems, Windows NT and Unix, compare and how they are likely to emerge from the transitional period over the next two years.
Critical reliability
Every industry has applications that are sensitive to interruption or which are so vital to the function of the business that any interruption causes disruption and loss of business or profits. Such mission critical operations are obvious in industries like securities, where a computer failure can stop trading and literally shut down a business.
What are the mission critical applications for the oil and gas industry? There are certainly a few which come to mind:
- In refinery operations, if the SCADA systems or process control mechanism fail, the refinery will shut down, halting operations.
- In seismic data acquisition, failure of an onboard recording system can halt operations if no redundant system is available.
- In seismic processing and reservoir simulation applications, there are some complex jobs which are not amenable to being interrupted and restarted. They either run in a single pass, or they must be run a second time from the beginning. Under certain circumstances, such as time pressure situations, these applications might also be considered mission critical.
- In drilling operations, failure of the control systems and safety systems could shut down the operation of the rig, resulting in lost downhole equipment, or even the loss of the well itself; all very expensive.
Viewed as a whole, the oil and gas industry certainly has applications which may be defined as mission critical.
Reliability choices
At present, there are three approaches to insuring computer systems are available and reliable. These are: external redundancy, internal-redundancy, and highly reliable technologies for individual components.
- External redundancy: Key components or even the entire system is mirrored, or duplicated, on a separate machine. Both machines are tied together via a network such that the operations being carried out on the primary working machine are duplicated passively on the parallel machine. If the primary machine fails for any reason, software steps in and switches the operations over to the backup machine in such a manner that there is little interruption and so that work can continue. The primary machine can then be fixed while the backup machine runs the applications. This is a relatively expensive option since it requires two separate, identical systems.
- Internal redundancy: Another type of redundancy involves duplication of components within a single computer system. Mirroring systems on duplicate disks, supporting different processor pipelines so that if one goes down the other still works, and using hot-swappable components which allow replacement of faulty parts while the system continues to run, can create such an internally-redundant system.
- Highly reliable technologies: Finally, designing components using the most reliable technologies available, including operating systems, is another way of increasing the overall reliability of a computer system. In this, computer designers and manufacturers have been very successful. Vendors of servers based upon the Unix operating system have achieved phenomenal reliability figures, sometimes exceeding 99.999% uptime.
Value chain uptime
The reliability of a computer system is represented as a percentage of the time it operates versus the time it is inoperable. Reliability can also be expressed as mean time between failures, or the average interval expressed in hours between machine crashes. However it is expressed, the reliability of the system directly relates to its value. The more reliable the machine, the fewer times the critical job stream will be interrupted and the fewer repairs that will be necessary.
Of course, we all know that when a machine goes down for any reason, there will be an interval ranging from hours to days wherein the machine will not be operational. Thus, a computer system that averages only five hours of unscheduled downtime per year may, realistically, take a full day to fix once it fails. However, this also means that there are five other machines that do not fail at all in order to attain the average. Thus, although average reliability figures can be ambiguous, they can still be a valuable indication of how a machine is likely to perform.
Nearly all servers sold for any application boast an uptime average of better than 95%. This means that the system operates reliably 95% of the time. Any machine that is 95% reliable must be pretty good - not necessarily.
The 95% reliable machine goes down for an average of nearly 1.5 hours per 24-hr day. Even a 99% reliable computer goes down unexpectedly for one hour every four days. For many applications, this is acceptable, even if annoying. However, for a mission critical or near critical application, this may be costly and unacceptable.
Unix machines from Auspex, among others, have achieved documented reliability figures in real use, not in the lab, of better than 99.995%. No reportable figures are in yet for the new Auspex 4Front7 servers - machines that are bilingual in both Unix and Windows NT, but reports to date indicate similar reliability numbers.
Recently, both Data General and Compaq have reported lab test results for NT-based systems showing reliability of 99.95%. Both of these sets of results are based upon advanced system designs incorporating redundant elements to enhance reliability. These companies are to be commended for leading the way. Other major manufacturers are beginning to report reliability numbers on the same order.
Real costs
In terms of cost, what is the real value of the difference in reliability? How much reliability can you purchase for the server dollar? Where is the best value likely to be found?
The following table shows the results of our study. The table compares different types of servers, showing configuration, average reliability numbers one can expect from them, and their purchase price. In the final column, we present a figure of the actual cost of the system.
This number was derived by dividing the purchase price of the computer by the total number of running hours possible in one year. We then multiplied this cost per hour figure by the average number of down hours per year to arrive at a total cost of the downtime experienced, on average, by each computer. The lower the dollar figure in this column, the more efficient the computer, and perhaps also the more cost effective it is, and therefore, the better investment for a mission critical application.
These calculations would seem to indicate that the Unix boxes with redundancy and/or mirroring still represent the best value for mission-critical applications. The lower initial cost of the NT servers is offset by the extremely high reliability numbers turned in by the Unix machines. However, the Unix machines do require a higher initial investment, and in general, the maintenance figures for Unix machines do tend to be higher than for their NT counterparts.
Conclusions
As users of these computers, we must ask ourselves the question: "When is the additional expense of the more efficient Unix servers justified over the more affordable, but less reliable, NT servers?" A corollary question is "How soon will the reliability of Windows NT servers equal that now being achieved by the Unix devices?" The answers to these questions are unique to each company, and the future is always an imponderable.
For mission critical applications, the reliability of higher cost server systems is justified. If one cannot afford down-time, the initial investment becomes less significant. The fact that the more reliable systems are also more cost effective is a bonus.
However, for ordinary daily activities such as interpretation, analysis, report generation, and modeling, it is perhaps not so critical to have a leading edge machine that is utterly reliable. Perhaps we can justify the advanced NT server with a derived cost figure of $205 over the Unix server with more than twice the reliability, but with twice the purchase cost and maintenance costs as well. Does it really matter if the computer has to be restarted once a week as opposed to once per year? Perhaps not.
Also, as Windows NT continues to develop, and the impending release of Windows 2000 finally brings long-awaited capabilities to the operating system, it appears that Windows NT will be a viable operating system for the enterprise and for high-end scientific applications as well. Therefore, if NT is a priority in your company, you can reasonably expect to convert a large percentage of your Unix servers over effectively, without significant impact on productivity. The transition will not be painless or without compromise, but it should not be a profound negative.
For those with heavy investment in Unix systems, it will be possible to keep Unix where needed and to transition to the less expensive Windows NT platform where it makes sense. Certainly the reliability of the products will not be an undue hindrance, and the advent of bilingual systems, like those from Auspex, will make it possible to run both NT and Unix together on a single, ultra-reliable machine.
In the final analysis, the choice of system is uniquely dependant upon the needs and operational style of each individual company. A compatible, cost-effective solution can be derived for any set of requirements, incorporating the best of both of the server worlds at an affordable price.
Author
John Pohlman, a geophysicist, is President and Chief Researcher for Pohlman International, Inc, a Reno, Nevada-based research and consulting firm specializing in petroleum computing.
For further information, contact the author at Tel: 1-800-238-1268, (775) 787-1700, or website: www.pc-oil.com.
References
Materials in this article were derived from current research, including The Geoscience Workstation Market Surveycopyright, Windows NT in Petroleum Computingcopyright, The Linux Open-Source Operating System in Oil
and Gascopyright, and Immersive Visualization Systems - Collaborative Computing for the New Millenniumcopyright.