Decision Sciences â€ș Information Systems and Management

Scientific Computing and Data Management

Description

This cluster of papers focuses on the management, reproducibility, and provenance of scientific workflows, particularly in the fields of bioinformatics and computational research. It explores topics such as data provenance, workflow management systems, semantic web services, cyberinfrastructure, and software development for scientific applications.

Keywords

Scientific Workflows; Reproducibility; Data Provenance; Workflow Management; Bioinformatics; Semantic Web Services; Cyberinfrastructure; Computational Research; Software Development; Ontologies

As computational work becomes more and more integral to many aspects of scientific research, computational reproducibility has become an issue of increasing importance to computer systems researchers and domain scientists 
 As computational work becomes more and more integral to many aspects of scientific research, computational reproducibility has become an issue of increasing importance to computer systems researchers and domain scientists alike. Though computational reproducibility seems more straight forward than replicating physical experiments, the complex and rapidly changing nature of computer environments makes being able to reproduce and extend such work a serious challenge. In this paper, I explore common reasons that code developed for one research project cannot be successfully executed or extended by subsequent researchers. I review current approaches to these issues, including virtual machines and workflow systems, and their limitations. I then examine how the popular emerging technology Docker combines several areas from systems research - such as operating system virtualization, cross-platform portability, modular re-usable elements, versioning, and a 'DevOps' philosophy, to address these challenges. I illustrate this with several examples of Docker use with a focus on the R statistical environment.
The Konstanz Information Miner is a modular environment, which enables easy visual assembly and interactive execution of a data pipeline. It is designed as a teaching, research and collaboration platform, 
 The Konstanz Information Miner is a modular environment, which enables easy visual assembly and interactive execution of a data pipeline. It is designed as a teaching, research and collaboration platform, which enables simple integration of new algorithms and tools as well as data manipulation or visualization methods in the form of new modules or nodes. In this paper we describe some of the design aspects of the underlying architecture, briey sketch how new nodes can be incorporated, and highlight some of the new features of version 2.0.
This article presents a new open‐source software tool, SciMAT , which performs science mapping analysis within a longitudinal framework. It provides different modules that help the analyst to carry out 
 This article presents a new open‐source software tool, SciMAT , which performs science mapping analysis within a longitudinal framework. It provides different modules that help the analyst to carry out all the steps of the science mapping workflow. In addition, SciMAT presents three key features that are remarkable in respect to other science mapping software tools: (a) a powerful preprocessing module to clean the raw bibliographical data, (b) the use of bibliometric measures to study the impact of each studied element, and (c) a wizard to configure the analysis.
Abstract Many scientific disciplines are now data and information driven, and new scientific knowledge is often gained by scientists putting together data analysis and knowledge discovery ‘pipelines’. A related trend 
 Abstract Many scientific disciplines are now data and information driven, and new scientific knowledge is often gained by scientists putting together data analysis and knowledge discovery ‘pipelines’. A related trend is that more and more scientific communities realize the benefits of sharing their data and computational services, and are thus contributing to a distributed data and computational community infrastructure (a.k.a. ‘the Grid’). However, this infrastructure is only a means to an end and ideally scientists should not be too concerned with its existence. The goal is for scientists to focus on development and use of what we call scientific workflows . These are networks of analytical steps that may involve, e.g., database access and querying steps, data analysis and mining steps, and many other steps including computationally intensive jobs on high‐performance cluster computers. In this paper we describe characteristics of and requirements for scientific workflows as identified in a number of our application projects. We then elaborate on Kepler, a particular scientific workflow system, currently under development across a number of scientific data management projects. We describe some key features of Kepler and its underlying Ptolemy II system, planned extensions, and areas of future research. Kepler is a community‐driven, open source project, and we always welcome related projects and new contributors to join. Copyright © 2005 John Wiley & Sons, Ltd.
Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data 
 Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources.In this paper we create a taxonomy of data provenance characteristics and apply it to current research efforts in e-science, focusing primarily on scientific workflow approaches. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. The survey culminates with an identification of open research problems in the field.
No abstract available. No abstract available.
Computational science has led to exciting new developments, but the nature of the work has exposed limitations in our ability to evaluate published findings. Reproducibility has the potential to serve 
 Computational science has led to exciting new developments, but the nature of the work has exposed limitations in our ability to evaluate published findings. Reproducibility has the potential to serve as a minimum standard for judging scientific claims when full independent replication of a study is not possible.
Computing in science and engineering is now ubiquitous: digital technologies underpin, accelerate, and enable new, even transformational, research in all domains. Access to an array of integrated and well-supported high-end 
 Computing in science and engineering is now ubiquitous: digital technologies underpin, accelerate, and enable new, even transformational, research in all domains. Access to an array of integrated and well-supported high-end digital services is critical for the advancement of knowledge. Driven by community needs, the Extreme Science and Engineering Discovery Environment (XSEDE) project substantially enhances the productivity of a growing community of scholars, researchers, and engineers (collectively referred to as "scientists"' throughout this article) through access to advanced digital services that support open research. XSEDE's integrated, comprehensive suite of advanced digital services federates with other high-end facilities and with campus-based resources, serving as the foundation for a national e-science infrastructure ecosystem. XSEDE's e-science infrastructure has tremendous potential for enabling new advancements in research and education. XSEDE's vision is a world of digitally enabled scholars, researchers, and engineers participating in multidisciplinary collaborations to tackle society's grand challenges.
Verification and validation of numerical models of natural systems is impossible. This is because natural systems are never closed and because model results are always nonunique. Models can be confirmed 
 Verification and validation of numerical models of natural systems is impossible. This is because natural systems are never closed and because model results are always nonunique. Models can be confirmed by the demonstration of agreement between observation and prediction, but confirmation is inherently partial. Complete confirmation is logically precluded by the fallacy of affirming the consequent and by incomplete access to natural phenomena. Models can only be evaluated in relative terms, and their predictive value is always open to question. The primary value of models is heuristic.
Raster3D Version 2.0 is a program suite for the production of photorealistic molecular graphics images. The code is hardware independent, and is particularly suited for use in producing large raster 
 Raster3D Version 2.0 is a program suite for the production of photorealistic molecular graphics images. The code is hardware independent, and is particularly suited for use in producing large raster images of macromolecules for output to a film recorder or high-quality color printer. The Raster3D suite contains programs for composing illustrations of space-filling models, ball-and-stick models and ribbon-and-cylinder representations. It may also be used to render figures composed using other graphics tools, notably the widely used program Molscript [Kraulis (1991). J. Appl. Cryst. 24, 946-950].
Using an open-access distribution model, the Crystallography Open Database (COD, http://www.crystallography.net) collects all known 'small molecule / small to medium sized unit cell' crystal structures and makes them available freely 
 Using an open-access distribution model, the Crystallography Open Database (COD, http://www.crystallography.net) collects all known 'small molecule / small to medium sized unit cell' crystal structures and makes them available freely on the Internet. As of today, the COD has aggregated ∌150 000 structures, offering basic search capabilities and the possibility to download the whole database, or parts thereof using a variety of standard open communication protocols. A newly developed website provides capabilities for all registered users to deposit published and so far unpublished structures as personal communications or pre-publication depositions. Such a setup enables extension of the COD database by many users simultaneously. This increases the possibilities for growth of the COD database, and is the first step towards establishing a world wide Internet-based collaborative platform dedicated to the collection and curation of structural knowledge.
Science mapping aims to build bibliometric maps that describe how specific disciplines, scientific domains, or research fields are conceptually, intellectually, and socially structured. Different techniques and software tools have been 
 Science mapping aims to build bibliometric maps that describe how specific disciplines, scientific domains, or research fields are conceptually, intellectually, and socially structured. Different techniques and software tools have been proposed to carry out science mapping analysis. The aim of this article is to review, analyze, and compare some of these software tools, taking into account aspects such as the bibliometric techniques available and the different kinds of analysis.
Abstract Summary: The two main functions of bioinformatics are the organization and analysis of biological data using computational resources. Geneious Basic has been designed to be an easy-to-use and flexible 
 Abstract Summary: The two main functions of bioinformatics are the organization and analysis of biological data using computational resources. Geneious Basic has been designed to be an easy-to-use and flexible desktop software application framework for the organization and analysis of biological data, with a focus on molecular sequences and related data types. It integrates numerous industry-standard discovery analysis tools, with interactive visualizations to generate publication-ready images. One key contribution to researchers in the life sciences is the Geneious public application programming interface (API) that affords the ability to leverage the existing framework of the Geneious Basic software platform for virtually unlimited extension and customization. The result is an increase in the speed and quality of development of computation tools for the life sciences, due to the functionality and graphical user interface available to the developer through the public API. Geneious Basic represents an ideal platform for the bioinformatics community to leverage existing components and to integrate their own specific requirements for the discovery, analysis and visualization of biological data. Availability and implementation: Binaries and public API freely available for download at http://www.geneious.com/basic, implemented in Java and supported on Linux, Apple OSX and MS Windows. The software is also available from the Bio-Linux package repository at http://nebc.nerc.ac.uk/news/geneiousonbl. Contact: [email protected]
Abstract Increased reliance on computational approaches in the life sciences has revealed grave concerns about how accessible and reproducible computation-reliant results truly are. Galaxy http://usegalaxy.org , an open web-based platform 
 Abstract Increased reliance on computational approaches in the life sciences has revealed grave concerns about how accessible and reproducible computation-reliant results truly are. Galaxy http://usegalaxy.org , an open web-based platform for genomic research, addresses these problems. Galaxy automatically tracks and manages data provenance and provides support for capturing the context and intent of computational methods. Galaxy Pages are interactive, web-based documents that provide users with a medium to communicate a complete computational analysis.
Basis sets are some of the most important input data for computational models in the chemistry, materials, biology, and other science domains that utilize computational quantum mechanics methods. Providing a 
 Basis sets are some of the most important input data for computational models in the chemistry, materials, biology, and other science domains that utilize computational quantum mechanics methods. Providing a shared, Web-accessible environment where researchers can not only download basis sets in their required format but browse the data, contribute new basis sets, and ultimately curate and manage the data as a community will facilitate growth of this resource and encourage sharing both data and knowledge. We describe the Basis Set Exchange (BSE), a Web portal that provides advanced browsing and download capabilities, facilities for contributing basis set data, and an environment that incorporates tools to foster development and interaction of communities. The BSE leverages and enables continued development of the basis set library originally assembled at the Environmental Molecular Sciences Laboratory.
Abstract The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread 
 Abstract The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.
Abstract Motivation: In silico experiments in bioinformatics involve the co-ordinated use of computational tools and information repositories. A growing number of these resources are being made available with programmatic access 
 Abstract Motivation: In silico experiments in bioinformatics involve the co-ordinated use of computational tools and information repositories. A growing number of these resources are being made available with programmatic access in the form of Web services. Bioinformatics scientists will need to orchestrate these Web services in workflows as part of their analyses. Results: The Taverna project has developed a tool for the composition and enactment of bioinformatics workflows for the life sciences community. The tool includes a workbench application which provides a graphical user interface for the composition of workflows. These workflows are written in a new language called the simple conceptual unified flow language (Scufl), where by each step within a workflow represents one atomic task. Two examples are used to illustrate the ease by which in silico experiments can be represented as Scufl workflows using the workbench application. Availability: The Taverna workflow system is available as open source and can be downloaded with example Scufl workflows from http://taverna.sourceforge.net
High-throughput data production technologies, particularly 'next-generation' DNA sequencing, have ushered in widespread and disruptive changes to biomedical research. Making sense of the large datasets produced by these technologies requires sophisticated 
 High-throughput data production technologies, particularly 'next-generation' DNA sequencing, have ushered in widespread and disruptive changes to biomedical research. Making sense of the large datasets produced by these technologies requires sophisticated statistical and computational methods, as well as substantial computational power. This has led to an acute crisis in life sciences, as researchers without informatics training attempt to perform computation-dependent analyses. Since 2005, the Galaxy project has worked to address this problem by providing a framework that makes advanced computational tools usable by non experts. Galaxy seeks to make data-intensive research more accessible, transparent and reproducible by providing a Web-based environment in which users can perform computational analyses and have all of the details automatically tracked for later inspection, publication, or reuse. In this report we highlight recently added features enabling biomedical analyses on a large scale.
It is increasingly necessary for researchers in all fields to write computer code, and in order to reproduce research results, it is important that this code is published. We present 
 It is increasingly necessary for researchers in all fields to write computer code, and in order to reproduce research results, it is important that this code is published. We present Jupyter notebooks, a document format for publishing code, results and explanations in a form that is both readable and executable. We discuss various tools and use cases for notebook documents.
Here we present Singularity, software developed to bring containers and reproducibility to scientific computing. Using Singularity containers, developers can work in reproducible environments of their choosing and design, and these 
 Here we present Singularity, software developed to bring containers and reproducibility to scientific computing. Using Singularity containers, developers can work in reproducible environments of their choosing and design, and these complete environments can easily be copied and executed on other platforms. Singularity is an open source initiative that harnesses the expertise of system and software engineers and researchers alike, and integrates seamlessly into common workflows for both of these groups. As its primary use case, Singularity brings mobility of computing to both users and HPC centers, providing a secure means to capture and distribute software and compute environments. This ability to create and deploy reproducible environments across these centers, a previously unmet need, makes Singularity a game changing development for computational science.
Abstract UCSF ChimeraX is next‐generation software for the visualization and analysis of molecular structures, density maps, 3D microscopy, and associated data. It addresses challenges in the size, scope, and disparate 
 Abstract UCSF ChimeraX is next‐generation software for the visualization and analysis of molecular structures, density maps, 3D microscopy, and associated data. It addresses challenges in the size, scope, and disparate types of data attendant with cutting‐edge experimental methods, while providing advanced options for high‐quality rendering (interactive ambient occlusion, reliable molecular surface calculations, etc.) and professional approaches to software design and distribution. This article highlights some specific advances in the areas of visualization and usability, performance, and extensibility. ChimeraX is free for noncommercial use and is available from http://www.rbvi.ucsf.edu/chimerax / for Windows, Mac, and Linux.
Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as 
 Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended. During the last two years, the Galaxy team and the open-source community around Galaxy have made substantial improvements to Galaxy's core framework, user interface, tools, and training materials. Framework and user interface improvements now enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed. The Galaxy community has led an effort to create numerous high-quality tutorials focused on common types of genomic analyses. The Galaxy developer and user communities continue to grow and be integral to Galaxy's development. The number of Galaxy public servers, developers contributing to the Galaxy framework and its tools, and users of the main Galaxy server have all increased substantially.
The development of magnetic resonance imaging (MRI) techniques has defined modern neuroimaging. Since its inception, tens of thousands of studies using techniques such as functional MRI and diffusion weighted imaging 
 The development of magnetic resonance imaging (MRI) techniques has defined modern neuroimaging. Since its inception, tens of thousands of studies using techniques such as functional MRI and diffusion weighted imaging have allowed for the non-invasive study of the brain. Despite the fact that MRI is routinely used to obtain data for neuroscience research, there has been no widely adopted standard for organizing and describing the data collected in an imaging experiment. This renders sharing and reusing data (within or between labs) difficult if not impossible and unnecessarily complicates the application of automatic pipelines and quality assurance protocols. To solve this problem, we have developed the Brain Imaging Data Structure (BIDS), a standard for organizing and describing MRI datasets. The BIDS standard uses file formats compatible with existing software, unifies the majority of practices already common in the field, and captures the metadata necessary for most common data processing operations.
RESUMENEvaluaciĂłn del efecto de un curso nivelatorio de matemĂĄticas en educaciĂłn superior: el caso de MatemĂĄticas BĂĄsicas La investigaciĂłn evalĂșa los efectos de tomar un curso de nivelaciĂłn obligatorio, que 
 RESUMENEvaluaciĂłn del efecto de un curso nivelatorio de matemĂĄticas en educaciĂłn superior: el caso de MatemĂĄticas BĂĄsicas La investigaciĂłn evalĂșa los efectos de tomar un curso de nivelaciĂłn obligatorio, que se ofrece una Ășnica vez (i.e.no puede repetirse) para estudiantes de pregrado, sobre la probabilidad de matricularse, el desempeño en las asignaturas universitarias de matemĂĄticas, avance en la carrera y probabilidad de graduarse.La investigaciĂłn emplea un diseño de regresiĂłn discontinua que aprovecha el hecho de que los estudiantes admitidos a la universidad que tengan en la prueba de ingreso un puntaje en matemĂĄticas inferior a un umbral estĂĄn obligados a tomar el curso de nivelaciĂłn de matemĂĄticas bĂĄsicas.Se encuentra que el curso de nivelaciĂłn no tiene un efecto en la probabilidad de matricularse, de desvincularse del programa ni de graduarse seis años despuĂ©s de haber sido admitido.Hay un efecto
Reproducibility is essential to reliable scientific discovery in high-throughput experiments. In this work we propose a unified approach to measure the reproducibility of findings identified from replicate experiments and identify 
 Reproducibility is essential to reliable scientific discovery in high-throughput experiments. In this work we propose a unified approach to measure the reproducibility of findings identified from replicate experiments and identify putative discoveries using reproducibility. Unlike the usual scalar measures of reproducibility, our approach creates a curve, which quantitatively assesses when the findings are no longer consistent across replicates. Our curve is fitted by a copula mixture model, from which we derive a quantitative reproducibility score, which we call the "irreproducible discovery rate" (IDR) analogous to the FDR. This score can be computed at each set of paired replicate ranks and permits the principled setting of thresholds both for assessing reproducibility and combining replicates. Since our approach permits an arbitrary scale for each replicate, it provides useful descriptive measures in a wide variety of situations to be explored. We study the performance of the algorithm using simulations and give a heuristic analysis of its theoretical properties. We demonstrate the effectiveness of our method in a ChIP-seq experiment.
<ns4:p>Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation 
 <ns4:p>Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid.</ns4:p><ns4:p>Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results.</ns4:p>
Galaxy is a mature, browser accessible workbench for scientific computing. It enables scientists to share, analyze and visualize their own data, with minimal technical impediments. A thriving global community continues 
 Galaxy is a mature, browser accessible workbench for scientific computing. It enables scientists to share, analyze and visualize their own data, with minimal technical impediments. A thriving global community continues to use, maintain and contribute to the project, with support from multiple national infrastructure providers that enable freely accessible analysis and training services. The Galaxy Training Network supports free, self-directed, virtual training with >230 integrated tutorials. Project engagement metrics have continued to grow over the last 2 years, including source code contributions, publications, software packages wrapped as tools, registered users and their daily analysis jobs, and new independent specialized servers. Key Galaxy technical developments include an improved user interface for launching large-scale analyses with many files, interactive tools for exploratory data analysis, and a complete suite of machine learning tools. Important scientific developments enabled by Galaxy include Vertebrate Genome Project (VGP) assembly workflows and global SARS-CoV-2 collaborations.
In this work the software application called <b>Glotaran</b> is introduced as a Java-based graphical user interface to the R package <b>TIMP</b>, a problem solving environment for fitting superposition models to 
 In this work the software application called <b>Glotaran</b> is introduced as a Java-based graphical user interface to the R package <b>TIMP</b>, a problem solving environment for fitting superposition models to multi-dimensional data. <b>TIMP</b> uses a command-line user interface for the interaction with data, the specification of models and viewing of analysis results. Instead, <b>Glotaran</b> provides a graphical user interface which features interactive and dynamic data inspection, easier -- assisted by the user interface -- model specification and interactive viewing of results. The interactivity component is especially helpful when working with large, multi-dimensional datasets as often result from time-resolved spectroscopy measurements, allowing the user to easily pre-select and manipulate data before analysis and to quickly zoom in to regions of interest in the analysis results. <b>Glotaran</b> has been developed on top of the <b>NetBeans</b> rich client platform and communicates with R through the Java-to-R interface <b>Rserve</b>. The background and the functionality of the application are described here. In addition, the design, development and implementation process of <b>Glotaran</b> is documented in a generic way.
The Life Detection Knowledge Base (LDKB; https://lifedetectionforum.com/ldkb) is a community-owned web resource that is designed to facilitate the infusion of astrobiology knowledge and expertise into the conceptualization and design of 
 The Life Detection Knowledge Base (LDKB; https://lifedetectionforum.com/ldkb) is a community-owned web resource that is designed to facilitate the infusion of astrobiology knowledge and expertise into the conceptualization and design of life detection missions. The aim of the LDKB is to gather and organize diverse knowledge from a range of fields into a common reference frame to support mission science risk assessment, specifically in terms of the potential for false positive and false negative results when pursuing a particular observation strategy. Within the LDKB, knowledge sourced from the primary scientific literature is organized according to (1) a taxonomic classification scheme in which potential biosignatures are defined at a uniform level of granularity that corresponds to observable physical or chemical quantities, qualities, or states; (2) a set of four standard assessment criteria, uniformly applied to each potential biosignature, that target the factors that contribute to false positive and false negative potential; and (3) a discourse format that utilizes customizable, user-defined "arguments" to represent the essential aspects of relevant scientific literature in terms of their specific bearing on one of the four assessment criteria, and thereby on false positive and false negative potential. By mapping available and newly emerging knowledge into this standardized framework, we can identify areas where the current state of knowledge supports a well-informed science risk assessment as well as critical knowledge gaps where focused research could help flesh out and mature promising life detection approaches.
ABSTRACT Biofilms are groups of microbes that live together in dense communities, often attached to a surface. They play an outsized role in all aspects of microbial life, from chronic 
 ABSTRACT Biofilms are groups of microbes that live together in dense communities, often attached to a surface. They play an outsized role in all aspects of microbial life, from chronic infections to biofouling to dental decay. In recent decades, appreciation for the diversity of roles that biofilms play in the environment has grown. Yet, most bacterial studies still rely upon approaches developed in the 19 th century and center on planktonic populations alone. Here we present a chemostat-based experimental platform to investigate not only biofilms themselves, but how they interact with their surrounding environments. Our results show that biofilms grow to larger sizes in chemostats as opposed to flasks. In addition, we show that biofilms may be a consistent source of migrants into planktonic populations. We also show that secondary biofilms rapidly develop, although these may be more susceptible to environmental conditions. Taken together, our data suggest that chemostats may be a flexible and insightful platform for the study of biofilms in vitro . IMPORTANCE Biofilms are the predominant way that bacteria live in natural environments and are characterized by three emergent properties: ubiquity, resilience, and impact. They can be found across all environments, both natural and human-made, and across all of recorded time, dating back at least 3.5 billion years. Biofilms also represent a major economic impact of over $5 trillion annually. Yet, most of what is known about bacteria is the result of studies using planktonically growing liquid cultures under laboratory conditions. Here, we propose a comprehensive experimental platform that allows for study of biofilms on the molecular, organismal, and community levels using chemostats.
Abstract The climate crisis and rising energy costs have motivated us to analyze the energy consumed by the operation of the molecular beam epitaxy (MBE) lab at our institute. This 
 Abstract The climate crisis and rising energy costs have motivated us to analyze the energy consumed by the operation of the molecular beam epitaxy (MBE) lab at our institute. This research lab houses 13 growth chambers for different material systems. Its operation requires in total 156 kW. 31% are needed for cooling water, 20% for cryo pumps, 17% for liquid nitrogen (based on the energy required for liquefaction), and 17% for ventilation. With 15% the electricity required for powering the actual MBE systems as a matter of fact amounts to the smallest category. In order to save energy, on the one hand we improved operation, i.e. we reduced the ventilation during off-hours and increased efforts to switch off liquid nitrogen and cryo pumps for those times. On the other hand, we changed hardware. In particular, we removed ventilation filters and replaced cryo pumps by turbo pumps because the latter consume only one tenth of the electricity at the low gas loads common for MBE. This analysis is partially specific to our facilities but may still provide inspiration and guidance to reduce the energy consumption in other MBE labs.&amp;#xD;
Artificial Intelligence is becoming increasingly embedded in various areas of human life, offering new capabilities that go beyond traditional software systems. Unlike conventional programs that follow fixed instructions, AI can 
 Artificial Intelligence is becoming increasingly embedded in various areas of human life, offering new capabilities that go beyond traditional software systems. Unlike conventional programs that follow fixed instructions, AI can generate its own solutions after processing large volumes of data. However, human input remains essential in designing AI architecture and setting its goals. While AI improves efficiency and decision-making across fields, it also introduces new types of risks. These risks often arise not from malicious intent, but from unpredictable system behavior and user errors. This paper analyzes such risks using a systems perspective and logistic S-curve modeling to examine the AI lifecycle. The analysis shows that the first three stages—development, scaling, and stabilization—carry the highest levels of vulnerability. Key issues include design flaws, insufficient debugging, and lack of continuous monitoring. More advanced systems may evolve through multiple S-curve phases, each introducing new challenges. The study emphasizes the need for stronger legal and ethical standards, drawing on regulatory efforts from the EU, USA, UK, Germany, and France. International cooperation is also highlighted as a key factor in ensuring that AI develops safely and responsibly
Machine Learning (ML) is increasingly applied across various domains, addressing tasks such as predictive analytics, anomaly detection, and decision-making. Many of these applications share similar underlying tasks, offering potential for 
 Machine Learning (ML) is increasingly applied across various domains, addressing tasks such as predictive analytics, anomaly detection, and decision-making. Many of these applications share similar underlying tasks, offering potential for systematic reuse. However, existing reuse in ML is often fragmented, small-scale, and ad hoc, focusing on isolated components such as pretrained models or datasets without a cohesive framework. Product Line Engineering (PLE) is a well-established approach for achieving large-scale systematic reuse in traditional engineering. It enables efficient management of core assets like requirements, models, and code across product families. However, traditional PLE is not designed to accommodate ML-specific assets—such as datasets, feature pipelines, and hyperparameters—and is not aligned with the iterative, data-driven workflows of ML systems. To address this gap, we propose Machine Learning Product Line Engineering (ML PLE), a framework that adapts PLE principles for ML systems. In contrast to conventional ML reuse methods such as transfer learning or fine-tuning, our framework introduces a systematic, variability-aware reuse approach that spans the entire lifecycle of ML development, including datasets, pipelines, models, and configuration assets. The proposed framework introduces the key requirements for ML PLE and the lifecycle process tailored to machine-learning-intensive systems. We illustrate the approach using an industrial case study in the context of space systems, where ML PLE is applied for data analytics of satellite missions.
The adoption of Electronic Lab Notebooks (ELNs) significantly enhances research operations by enabling the streamlined capture, storage, and dissemination of data. This promotes collaboration and ensures organised and efficient access 
 The adoption of Electronic Lab Notebooks (ELNs) significantly enhances research operations by enabling the streamlined capture, storage, and dissemination of data. This promotes collaboration and ensures organised and efficient access to critical research information. Microsoft SharePoint¼ (SP) is an established, widely used, web-based platform with advanced collaboration capabilities. This study investigates whether SP can meet the needs of engineering research projects, particularly in a collaborative environment. The paper outlines the process of adapting SP into an ELN tool and evaluates its effectiveness compared to established ELN systems. The evaluation considers several categories related to data management, ranging from data collection to publication. Six distinct application scenarios are analysed, representing a spectrum of collaborative research projects, ranging from small-scale initiatives with minimal processes and data to large-scale, complex projects with extensive data requirements. The results indicate that SP is competitive in relation with established ELN tools, ranking second among the six alternatives evaluated. The adapted version of SP proves particularly effective for managing data in engineering research projects involving both academic and industrial partners, accommodating datasets for around 1000 samples. The practical implementation of SP is demonstrated through a collaborative engineering research project, showing its use in everyday research tasks such as data documentation, workflow automation, and data export. The study highlights the benefits and usability of the adapted SP version, including its support for regulatory compliance and reproducibility in research workflows. In addition, limitations and lessons learned are discussed, providing insights into the potential and challenges of using SP as an ELN tool in collaborative research projects.
Scientific workflow reproducibility for hydrological and environmental analyses remains a challenge due to the heterogeneity of data sources, analysis protocols, and evolving visualization needs. This study introduces HydroBlox, a client-side 
 Scientific workflow reproducibility for hydrological and environmental analyses remains a challenge due to the heterogeneity of data sources, analysis protocols, and evolving visualization needs. This study introduces HydroBlox, a client-side browser-based framework that supports the creation, execution, and export of hydrological workflows using a visual programming interface. The platform integrates modular web libraries to perform data retrieval, statistical analysis, and visualization directly in the browser. Two case studies are presented in the study includes analyzing precipitation-streamflow response relationships in the Iowa River Basin and computing the Standardized Precipitation Index using a WebAssembly-enhanced drought analysis workflow. Results demonstrate the system’s capacity to facilitate reproducible, portable, and extensible hydrological analyses across spatial and temporal scales. The study discusses the architecture, implementation, and capabilities of the system and explores its implications for collaborative research, education, and low-code scientific computing in hydrology.
It is well established that groundwater flows in three-dimensional (3-D) flow systems which can be simple or complex. It is obvious then that the study of groundwater requires data collected 
 It is well established that groundwater flows in three-dimensional (3-D) flow systems which can be simple or complex. It is obvious then that the study of groundwater requires data collected from a variety of locations distributed in three dimensions. The design, placement and operation of appropriate instruments for collecting the necessary data poses a variety of difficult challenges which can affect the required budget and schedule, the quality of the resulting data and the quality of the conclusions drawn based on the data. Based on the authors’ decades of experience this book investigates some of the important methods and approaches available to the hydrogeologic practitioner to achieve collection of high quality data and to avoid the common pitfalls of incomplete or incorrect data with the associated risks of confirmation bias in decision making. Conclusions are drawn with regard to the correct implementation of multilevel systems (MLS) for collection of high quality data on which to base defensible scientific interpretation and engineering designs.