Data-intensive science

This post is a part of a project I’m running called the the Framework for eResearch Adoption. The original post is here. There’ll be a series of posts on aspects of eResearch including data reuse, computationally intensive research, and virtual research collaborations.

The reuse and management of research data is becoming increasingly important. Data-intensive science represents a transition from traditional hypothesis and experimentation, to identifying patterns, and undertaking modelling and simulation using increasingly massive volumes of data collected by thousands of researchers the world over. This means more breakthroughs across research discipline boundaries, and more bang for the research buck.

Because of this, data reuse is rapidly becoming a focus of policy and funding agencies, internationally, and in New Zealand. Open data is now government policy1. Managing research data well is also becoming essential in ensuring the integrity, transparency and robustness of research, so it can be defended against criticism and attack.

This article explores the trends in research data reuse and management. Further posts will look at the current policy context in New Zealand, the future requirements for institutional  data management, the risks of doing it poorly, and the benefits of doing it well.

What are the implications of these trends for eResearch and data reuse and management in New Zealand? What are the differences between Crown Research Institute’s (CRIs) and Universities in this regard? What do we need in place institutionally and nationally to support improved data management, and uptake of data-intensive scientific methods?

Please comment at the end of this article, or email julian.carver@seradigm.co.nz to give feedback.

Global Trends in Research Data Reuse and Management

The most important global trend impacting on research data management is the emergence of data-intensive science, or the ‘Fourth Paradigm2’. This involves:

  • The transition from science being about 1) empirical observation of the natural world, to 2) theoretically based with hypotheses and experimentation, to 3) computational using modelling and simulation, and now to 4) data-intensive ‘eScience’.
  • Collecting more and more data through automated systems including sensor networks, large and small instruments, DNA sequences, satellite imagery. This means data is not just collected in a bare minimum sense specifically for each research project, instead oceans of data are becoming available for use by many different researchers.
  • The huge increase in data volumes enables researchers to sift through the data, identify patterns, draw connections, develop and test hypotheses, and run experiments ‘in-virtuo’ using simulations.
  • This unification through information systems of the processes of observation, hypothesis, experimentation and simulation means science can tackle bigger problems, on larger scales, and involving greater numbers of researchers across the globe.

The transition to data-intensive science includes a number of specific trends in the research sector. These are as follows.

Data collection and aggregation:

  • The proliferation of automated data capture and collection technologies in almost every field of research (e.g. fMRI in neuroscience, EM soil mapping, bathymetry, and satellite imagery such as NZ’s Land Cover Database)
  • Increased use of national level and global level discipline specific data repositories (e.g. Genbank, GBIF, GEOSS3, NZ Social Science Data Service4)
  • Different research disciplines adopting data sharing and reuse, and the fourth paradigm in general, at different paces, often driven by the existence, or not, of very large scale infrastructure (e.g. Hubble Space Telescope, Large Hadron Collider) that by default stores data centrally, and driven also by the emergence of professional norms around central deposit of data on publication (e.g. Genbank).

Changes in the scale of simulation:

  • The nature of models is changing from representing small local systems to much broader spatial/temporal scopes, and simulations are being used to understand larger scale phenomena and make predictions (e.g. global climate simulations, detailed simulations of the human heart).
  • Models are becoming larger than any one researcher can program themselves, and are large collaborative efforts
  • Desktop computers are no longer sufficient to run models, and simulations require high performance computers and clusters, and require and generate massive amounts of data which needs to be moved between research institutions across the globe

Changes in the nature of and demand for verification/defensibility:

  • The grand challenges facing humanity require science to be done at large scales, and to challenge current consumption and behaviour patterns. This increasingly generates tension, and the scientific process is coming under much greater levels of scrutiny. Data on which conclusions are based, and the methods used to produce those results have to be available for independent verification
  • Scientific workflow technologies are emerging to automate and allow replication of data aggregation, analysis, interpretation and results again opening research and the data underpinning that research to greater examination.

Discovery and access:

  • Discovery of relevant data is becoming an issue as the number of data sets, and the volume of data grows. Metadata catalogues and federated data search engines are becoming essential, as are data preservation and curation activities.
  • Researchers are starting to require the ability to trawl and do automated comparisons of datasets to see if they’re like theirs, and then be able to drill down, look at the attributes they measured, disambiguate terms, and determine to what extent the datasets are comparable.
  • Ways of structuring data to support discovery, access and comparison are being rapidly developed and adopted, including Linked Data, structured ontologies, and the semantic web.

Increased collaboration (nationally, internationally, and cross-disciplinary):

  • Increased specialisation in research expertise, and in support functions such as informatics & data management, means bigger research teams are necessary and in turn require more collaboration/coordination and processes to allow data sharing and reuse.
  • Methods such as remote diagnostics are being developed, where data will move between someone who needs to know an answer, and a specialist in another part of the world (e.g. high resolution video imaging in real time through a stereo microscope at a shipping port to a biosystematics expert in another country).
  • Increased engagement of ‘citizen scientists’, doing some of the work of data gathering, and crowdsourcing of data analysis (e.g. species observation networks, Gold Corp opening its prospecting data and offering a reward to geologists, prospectors and academics worldwide who helped them locate deposits)

Shifts in publication processes:

  • In some fields the number of publications from reuse of data is starting to outstrip the number based on primary collection of the data
  • Publishers are requiring datasets supporting the research to be lodged at the time of publication, given unique identifiers, and in some cases made available for review before papers are published.
  • Methods such as dataset citation, scientific workflows emerging to cope with the need to manage complex data attribution chains

These global trends in science are also driven by technology trends, including:

  • User expectations about search, discovery, visualisation, and collaboration tools are being set by global scale consumer level providers such as Google, Amazon, Facebook who are funded commercially (e.g. Google’s annual R&D budget is NZD $2B, about the same as NZ’s entire science system spend).
  • Cloud computing is emerging as a way to significantly reduce cost, massively increase scalability of, and increase access to commodity computing and data storage infrastructure.
  • Mobile devices and their accompanying sensors, data storage and display technologies are being rapidly advanced by the consumer market (e.g. digital cameras, iPhones).
  • Software is increasingly shifting to the web as a delivery vehicle/user interface, software as a service is becoming more pervasive, and in many areas open source has become the dominant mode of software production

National Trends

Research data management is also impacted by national level trends, in other countries and in New Zealand specifically. These include:

  • The establishment of national centres to provide expert advice and services on data preservation, collection, curation and reuse (e.g. the Australian National Data Service, the UK Digital Curation Centre)
  • Increased coordination and sharing of data management infrastructure and tools across research institutions
  • The emerging requirement from research funding agencies for data management planning to be included in funding bids (e.g. US National Science Foundation announced this in May 20105, the UK Natural Environment Research Council has this as a requirement)
  • The rapid development of the ‘Open and Transparent Government’ movement in the last two years meaning elevated expectations about data access from the public and politicians, and more public money being put into data infrastructure (e.g. the US Open Government Initiative)
  • Open access licencing frameworks being adopted by individual countries, often based on Creative Commons and/or open source licences (e.g. the UK Open Government Licence, the New Zealand Government Open Access and Licensing (NZGOAL)6 framework)
  • Increasing use of open data in public consultation processes (e.g. the recent NZ National Environmental Standard for Plantation Forestry in New Zealand7, used an online discussion forum and provided access to relevant government datasets)
  • The establishment of an ‘open data’ community outside of government and research organisations, who have the skills and desire to take publicly funded data and develop value added tools and services (e.g. GreenVictoria8, a service aimed at increasing public awareness of climate change, using water consumption data and other Australian Government datasets; SwimWhere9, an NZ mashup and iPhone app using water quality data)

In New Zealand the government has strongly signalled a move towards coordination and sharing of ICT systems, resources and data across the public sector. This is expressed in the recently released ‘Directions and Priorities for Government ICT’1. This mandates the use of shared services where they are available. It also has a particular focus on open data, covered in Direction 2 ‘Support open and transparent government’ which includes the following priority:

“Support the public, communities and business to contribute to policy development and performance improvement”

It is accompanied by the following statements:

“Open and active release of government data will create opportunities for innovation, and encourage the public and government organisations to engage in joint efforts to improve service delivery.”

“Government data effectively belongs to the New Zealand public, and its release and re-use has the potential to:

  • allow greater participation in government policy development by offering insight and expert knowledge on released data (e.g. using geospatial data to analyse patterns of crime in communities)
  • enable educational, research, and scientific communities to build on existing data to gain knowledge and expertise and use it for new purposes”

Government agencies and research organisations in New Zealand are being encouraged to use NZGOAL rather than ‘all rights reserved’ copyright licences. There is an expectation from government that publicly funded research data be made openly available unless there are very good reasons not to (e.g. public safety, privacy, commercial sensitivity, exclusive use until after publication).

Archives New Zealand is currently planning a Government Digital Archive. This will enable Archives New Zealand to take in large-scale transfers of government agency digital records, such as email messages, videos, databases and electronic documents. This may also be able to take in historical research datasets where organisations are not able to archive and publish these themselves. This project is being done in collaboration with the National Library of New Zealand and the existing infrastructure of the Library’s National Digital Heritage Archive (NDHA) will be leveraged to provide a full solution for digital public archives.

In the New Zealand research sector the National eScience Infrastructure (NeSI)10 business case has recently been approved by Cabinet. NeSI represents the most significant infrastructure investment for New Zealand’s Science System in the last twenty years. It will provide a nationally networked virtual high performance computing and data infrastructure facility distributed across NZ’s research institutions. NeSI is an initiative led by Canterbury University, Auckland University, NIWA and AgResearch , and is supported  by Otago University and Landcare Research. It will coordinate access to high performance computing facilities at these institutions, and the BeSTGRID eScience data fabric, research tools and applications, and community engagement.

What do you think?

So, what are the implications of these trends for eResearch and data reuse and management in New Zealand? What are the differences between Crown Research Institute’s (CRIs) and Universities in this regard? What do we need in place institutionally and nationally to support improved data management, and uptake of data-intensive scientific methods?

Please share your thoughts by commenting on this post, or by emailing julian.carver@seradigm.co.nz with your thoughts. Feedback will be incorporated into the Framework for eResearch Adoption project.

References

  1. Directions and Priorities for Government ICT http://www.dia.govt.nz/Directions-and-Priorities-for-Government-ICT
  2. Hey, Tony; Stewart Tansley and Kristin Tolle, Eds. “The Fourth Paradigm: Data-Intensive Scientific Discovery.” Microsoft Research. Redmond, Wash: 2009. PDF at http://research.microsoft.com/en-us/collaboration/fourthparadigm/default.aspx
  3. The Global Earth Observation System of Systems (GEOSS) Geoportal http://www.earthobservations.org/geoss.shtml
  4. http://www.nzssds.org.nz
  5. Scientists Seeking NSF Funding Will Soon Be Required to Submit Data Management Plans http://www.nsf.gov/news/news_summ.jsp?cntn_id=116928
  6. New Zealand Government Open Access and Licensing (NZGOAL) framework http://www.e.govt.nz/policy/nzgoal
  7. National Environmental Standard for Plantation Forestry in New Zealand http://www.mfe.govt.nz/laws/standards/forestry/index.html
  8. GreenVictoria http://www.ebudgetplanner.com/
  9. SwimWhere http://swimwhere.info/
  10. NeSI http://www.nesi.org.nz

Comments are closed.

Timid men prefer the calm of despotism to the boisterous sea of liberty.
Thomas Jefferson