Unstructured data management shifts industry to unified IT infrastructure

Feb. 10, 2015
Managing the associated unstructured data generated by seismic images is a major investment of money, resources, and equipment. Moreover, as increased resolution continues to come online, survey technologies in field and processing techniques stretch decade-old computing constructs. Raw data sizes are also growing exponentially. The advances made in seismic data acquisition technology factor in this growth; however, post-processing, compression, high-fidelity master archiving, and data redundancy schemes all serve as multiplier effects on the raw data ingest rate.

Manuel Terranova
Peaxy

Managing the associated unstructured data generated by seismic images is a major investment of money, resources, and equipment. Moreover, as increased resolution continues to come online, survey technologies in-field and processing techniques stretch decade-old computing constructs. Raw data sizes are also growing exponentially. The advances made in seismic data acquisition technology factor in this growth; however, post-processing, compression, high-fidelity master archiving, and data redundancy schemes all serve as multiplier effects on the raw data ingest rate. For many organizations, this compounding problem quickly translates to 20 to 50 petabytes per annum saved onto one hard drive or tape among thousands.

This rapidly-increasing accumulation of data will further challenge an industry that already struggles to give scientists and researchers ready access to data and management tools that enable advanced analytics.

Unlike structured data, moving or migrating, unstructured data often has a negative effect on its value. Unstructured data, when movedā€”and by definition, renamedā€”makes finding data again very difficult. Stated differently, because of the way today's dominant traditional storage systems are architected, hardware upgrades often orphan data from core analytics processes and the geoscientists who need it to inform interpretations and decision-making.

At multi-petabyte ingest rates, seismic analytical toolsā€”many of which were designed decades agoā€”are quickly showing their age. The industry is at the doorstep of a shift to more flexible, scalable constructs designed to specifically solve the problems of data access, data longevity, data management, and storage. While traditional storage technologies often perform effectively over the near term, traditional seismic architectures and monolithic data constructs leave a good deal of unstructured data value untapped. Generally, these systems were not designed to span multiple-technology refreshes or facilitate data access over longer periods of time.

Multi-national offshore drilling companies, oil and gas companies, and others involved in seismic pursuits are starting to realize that the inability to re-harvest this data over decades is not just a technical issue, but also a business problem that can affect future competitiveness. Companies that treat this data as a business-critical asset are becoming aware of the shortcomings within the current systems of architectural constructs, upon which organizations have relied for the past 30 years. Finding and accessing data readily are two must-have features that now threaten traditional storage approaches that for decades have focused on getting data into their system. Until now, getting the data out again has been a tertiary or, at best, a secondary consideration, but not for much longer.

Hyperfiler is a data management system that allows companies to create a petabyte-scale "dataplane" that logically combines disparate datasets. (Photo courtesy Peaxy)

Mission-critical data

At a basic level, seismic surveys collect very large amounts of raw data from sensors that are then filtered by supercomputers or other computational constructs to extract useful information to be analyzed by geoscientists. When a seismic survey is under way, the initial concern centers on the massive and unstructured nature of the raw data produced, which can range in size from 100 to 400 TB a day. These files are subject to further processes that can distill useful information from the sea of noise. Today's practice is to put this information into a physical storage construct of some sort, a process that will eventually fail to maximize the availability of data to teams spread over time and space. The problem is that once the data is dropped into this "storage bucket," major hurdles are encountered when scientists try to repurpose these datasets.

To maximize the production levels of reservoirs, companies need to be able to compare surveys of the same reservoirs taken five, 10, or 30 years apart from one another. While this is technically possible with systems currently in place, as a practical matter, traditional data architectures require too much pre-processing or specialized technical expertise to readily facilitate that kind of comparison. In many cases, decisions about where datasets should be stored are driven by tactical considerations such as costs and dwindling storage space. The end-user access need has little bearing on where and how data is stored. This makes things challenging enough, but the ever-persistent IT technology refresh cycle presents an even more formidable challenge for engineers trying to keep track of and access this data.

Lost data

The main problem is that every time data is moved, pathnames change, links get broken, and the data mapā€”where particular datasets are stored and the knowledge of how to access and use themā€”becomes increasingly fragmented. This approach relies on so-called tribal knowledge of this map in order to be effective over time, but when employee turnover over decades is factored in, that knowledge is almost certain to be lost. When it does inevitably break down, data can no longer be reasonably located and is, in effect, gone.

Expensive seismic surveying data "going dark" is a major cost; either that data will have to be re-collected, or in the case of historic data, it may simply be irretrievably lost and unavailable to future projects and new opportunities.

Simply put, there is a disconnect between the fundamental way data needs to be accessed and the structure of the underlying hardware and storage systems. Engineers and geoscientists need to access seismic data today which years from now is at odds with the nature of technology upgrades and the physical structure of storage systems that change and evolve over time.

Infrastructure for data longevity

Fortune 1000 companies and smaller seismic players that serve the offshore industry are expected to confront a cataclysmic change over the next 24 months. Given the exponential growth and revenues associated with the data, companies will have to find a way to not only locate all of this "dark" data, but also re-architect their computing infrastructure so that research teams can immediately access and collaborate on it just as easily 30 years from now as they can this week. Additionally, they will have to do it in a cost-effective manner in both the short and long terms without compromising scale and performance.

For example, GE Oil & Gas is one of a number of organizations tackling the inconvenient realities of constantly shifting hardware through a virtualization-centric data architecture that creates a unified namespace for large unstructured datasets. These new approaches essentially make data accessibleā€”even datasets that are highly distributed across storage mediums and across the globeā€”from a single virtual space, which does not change over time. From the user's perspective, it eliminates the heavy dependence on tribal knowledge for data location and preserves access to these datasets that would otherwise be lost in the next refresh cycle.

Solutions like these ultimately require a rethink of the modern IT infrastructure. It requires oil and gas companies to collaborate with engineering teams to design a new technological approach to managing and optimizing seismic data based on the following tenets:

Mission-critical unstructured datasets. These are precious assets that need to be maintained and re-harvested over their lifespan. Business leaders and heads of engineering need to be actively involved with IT in preserving these datasets and enabling geophysicists to manipulate them at any given time. These new requirements of data access, data longevity, and data location will join the traditional storage considerations: scale, performance, and cost.

IT architecture needs to evolve. Geoscientists' and data analysts' fundamental need for ready access to data needs to be resolved with an IT architecture that has not evolved to meet their needs. Today, it is too costly in terms of money and man hours to keep track of mission-critical seismic datasets over time. Many researchers and engineers lose 45 to 120 minutes a day managing, deleting, and finding unstructured seismic and related data.

Tribal knowledge is an ineffective long-term strategy. Over the course of decades, people who know where data is stored will leave the organization, and knowledge of its location will deteriorate over time. The amount of value organizations can derive from their unstructured seismic datasets depends on hosting the data in a solution that allows end users to find and access data easily and readilyā€”even without tribal knowledge.

The author

Manuel Terranova is president and CEO of Peaxy. Before co-founding the company, Terranova spent 13 years in various roles at General Electric, most recently as senior vice president at its Drilling and Production business.