Facing data deluge

Within the next decade, past, actual and future satellite Earth Observation (EO) missions, extended in situ networks and super-computer simulations will continue to accumulate huge volume of data at an increasingly growing resolution. Facing streams of data pouring from space and simulations dictates that tools and methods must be engaged to leverage such a wealth and to better link the past and/or near real-time complementary observing and modeling system elements. This can only be achieved with dedicated infrastructures to dynamically process massive information and to perform retrospective-analyses which will be essential key instruments for breakthroughs, and to stimulate multidisciplinary Earth system research and applications in marine and climate sciences.

SST-ReconstructionSynoptic fields of sea surface temperature can be built from the full-resolution (1km) data acquired over one day by polar-orbiting infra-red sensors such as AVHRR onboard Metop (about 30 to 25 GB a day). On the left image cloudy areas result in data gaps that can be filled in by combining one satellite with other existing satellites providing similar observations (up to 10 simultaneously). This requires the handling of more than 200 GB for a day of data, while requiring frequent reprocessing for improved sensor merging or best quality data screening. Performing – and repeating - such processing at climate scale (over more than 10 to 20 years – sea surface temperature data are routinely available from space since 1981) currently requires such efforts that only space agency initiatives or programs with large funding can afford it.Keeping long and massive mission archives alive by raising the level of data revisiting through multiple applications, demonstration products or services, or extensive data reprocessing is a major concern of CERSAT as a long-term multi-mission data archiving center. An increasing number of Earth Observation EO missions and improved in situ networks will further complement in the coming years this existing data stream, appending hundreds of Terabytes to the existing databases. This can provide an unprecedented capacity to observe the ocean and the atmosphere, but it is mandatory to accelerate the development of robust dedicated processing infrastructure to combine mining strategies, mass re-processing capabilities and simulations to fully benefit from this wealth of information.

This has to be performed keeping a minimum infrastructure cost and therefore optimizing all the available resources while allowing maximum flexibility in order to open access to these data and resources to a wider range of users and applications. It is our ambition to investigate and build on newly existing technologies and research outcomes to offer such service to the ocean community. Among the key aspects identified to provide an efficient and cost-effective access to historical massive multi-mission archives are :

  • fast and online access to massive collections of data : all data shall be accessible in a automated way (without any human handling of media, etc...) and with very low latency time: ideally all data shall be stored on a single (or a few) virtual storage space based on clusters of hard-drives with fast response and widely extensible. The data can then be accessed quickly anytime for any usage. Nowadays, Big Data technologies offer cheap solutions (used in massive data centers such as Google or Facebook) providing a level of flexibility and redundancy balancing strict hardware reliability (often expensive).
  • avoiding data duplication and transfer to users : the impossibility and inefficiency to copy and move large bunches of data that user can't store anyway dramatically reduces the exploitation that could be made of these data archives. Moving the processing instead of moving the data is a much more efficient way to deal with this issue, provided the processing capability is openly available and easily tailored to the user need through a remote processing service.
  • minimizing time from algorithm development to processing : if a user or application remotely triggers processing on the archive and processing service, it has to be flexible enough so that heterogeneous software, environments and applications can be run seamlessly on the same physical platform
  • allowing fast and easy to manage large scale reprocessing : a smart combination of hardware optimization - minimizing disk or network bandwidth consumption by locating the processing as close to the physical storage as possible -, of data replication, and of batch processing and reporting software tools - to easily distribute and monitor the processing load - must be sought
  • improving data storage and management : this goes from the choice of appropriate data formats (with respect to storage occupation, community standards and long-term preservation perspective) to the organization of the data collection, the management of the processing versions, etc...
  • intelligent dynamic and thematic data indexing : navigating through massive collections of data still requires advanced search capabilities to access straight to the relevant information with respect to the user's focus, offering comparable services to what is now widely available for web resources with search engine or semantic web.
  • reduction of the datasets through feature extraction or data mining techniques in order to retrieve the meaningful content from sometimes a widely redundant, oversampled or irrelevant mass of information

Ongoing efforts

We have recently undertaken a large effort to attempt to tackle these issues through innovative projects, partnerships and demonstrations which we intend to build over in the next coming years.

In particular, these needs have led us to investigate the different available technologies offering large distributed file systems, virtualization, on-demand allocation of processing resources, optimal job sequencing and monitoring to build a dedicated demonstration platform for mass data archiving and processing that would also allow for more traditional access (NFS, FTP, etc...) in order to ease a paced transition to these new processing paradigms. This resulted in the Nephelae demonstration platform described in this section.