Complex systems, composed at the most basic level of units and their interactions, describe phenomena in a wide variety of domains, from neuroscience to computer science and economics. The wide …
Complex systems, composed at the most basic level of units and their interactions, describe phenomena in a wide variety of domains, from neuroscience to computer science and economics. The wide variety of applications has resulted in two key challenges: the generation of many domain-specific strategies for complex systems analyses that are seldom revisited, and the compartmentalization of representation and analysis ideas within a domain due to inconsistency in complex systems language. In this work we propose basic, domain-agnostic language in order to advance toward a more cohesive vocabulary. We use this language to evaluate each step of the complex systems analysis pipeline, beginning with the system under study and data collected, then moving through different mathematical frameworks for encoding the observed data (i.e., graphs, simplicial complexes, and hypergraphs), and relevant computational methods for each framework. At each step we consider different types of dependencies; these are properties of the system that describe how the existence of an interaction among a set of units in a system may affect the possibility of the existence of another relation. We discuss how dependencies may arise and how they may alter the interpretation of results or the entirety of the analysis pipeline. We close with two real-world examples using coauthorship data and email communications data that illustrate how the system under study, the dependencies therein, the research question, and the choice of mathematical representation influence the results. We hope this work can serve as an opportunity for reflection for experienced complex systems scientists, as well as an introductory resource for new researchers.
Given a set of people and a set of events attended by them, we address the problem of measuring connectedness or tie strength between each pair of persons. The underlying …
Given a set of people and a set of events attended by them, we address the problem of measuring connectedness or tie strength between each pair of persons. The underlying assumption is that attendance at mutual events gives an implicit social network between people. We take an axiomatic approach to this problem. Starting from a list of axioms, which a measure of tie strength must satisfy, we characterize functions that satisfy all the axioms. We then show that there is a range of tie-strength measures that satisfy this characterization.
The brain is a complex system comprising a myriad of interacting elements, posing significant challenges in understanding its structure, function, and dynamics. Network science has emerged as a powerful tool …
The brain is a complex system comprising a myriad of interacting elements, posing significant challenges in understanding its structure, function, and dynamics. Network science has emerged as a powerful tool for studying such intricate systems, offering a framework for integrating multiscale data and complexity. Here, we discuss the application of network science in the study of the brain, addressing topics such as network models and metrics, the connectome, and the role of dynamics in neural networks. We explore the challenges and opportunities in integrating multiple data streams for understanding the neural transitions from development to healthy function to disease, and discuss the potential for collaboration between network science and neuroscience communities. We underscore the importance of fostering interdisciplinary opportunities through funding initiatives, workshops, and conferences, as well as supporting students and postdoctoral fellows with interests in both disciplines. By uniting the network science and neuroscience communities, we can develop novel network-based methods tailored to neural circuits, paving the way towards a deeper understanding of the brain and its functions.
Related DatabasesWeb of Science You must be logged in with an active subscription to view this.Article DataHistorySubmitted: 13 July 2020Accepted: 12 January 2021Published online: 13 May 2021Keywordsnonbacktracking, node immunization, centrality …
Related DatabasesWeb of Science You must be logged in with an active subscription to view this.Article DataHistorySubmitted: 13 July 2020Accepted: 12 January 2021Published online: 13 May 2021Keywordsnonbacktracking, node immunization, centrality measure, complex networksAMS Subject Headings05C50, 05C82, 05C85, 68R10Publication DataISSN (online): 2577-0187Publisher: Society for Industrial and Applied MathematicsCODEN: sjmdaq
Complex systems thinking is applied to a wide variety of domains, from neuroscience to computer science and economics. The wide variety of implementations has resulted in two key challenges: the …
Complex systems thinking is applied to a wide variety of domains, from neuroscience to computer science and economics. The wide variety of implementations has resulted in two key challenges: the progenation of many domain-specific strategies that are seldom revisited or questioned, and the siloing of ideas within a domain due to inconsistency of complex systems language. In this work we offer basic, domain-agnostic language in order to advance towards a more cohesive vocabulary. We use this language to evaluate each step of the complex systems analysis pipeline, beginning with the system and data collected, then moving through different mathematical formalisms for encoding the observed data (i.e. graphs, simplicial complexes, and hypergraphs), and relevant computational methods for each formalism. At each step we consider different types of \emph{dependencies}; these are properties of the system that describe how the existence of one relation among the parts of a system may influence the existence of another relation. We discuss how dependencies may arise and how they may alter interpretation of results or the entirety of the analysis pipeline. We close with two real-world examples using coauthorship data and email communications data that illustrate how the system under study, the dependencies therein, the research question, and choice of mathematical representation influence the results. We hope this work can serve as an opportunity of reflection for experienced complexity scientists, as well as an introductory resource for new researchers.
Machine learning (ML) started to become widely deployed in cyber security settings for shortening the detection cycle of cyber attacks. To date, most ML-based systems are either proprietary or make …
Machine learning (ML) started to become widely deployed in cyber security settings for shortening the detection cycle of cyber attacks. To date, most ML-based systems are either proprietary or make specific choices of feature representations and machine learning models. The success of these techniques is difficult to assess as public benchmark datasets are currently unavailable. In this paper, we provide concrete guidelines and recommendations for using supervised ML in cyber security. As a case study, we consider the problem of botnet detection from network traffic data. Among our findings we highlight that: (1) feature representations should take into consideration attack characteristics; (2) ensemble models are well-suited to handle class imbalance; (3) the granularity of ground truth plays an important role in the success of these methods.
The unsupervised detection of anomalies in time series data has important applications in user behavioral modeling, fraud detection, and cybersecurity. Anomaly detection has, in fact, been extensively studied in categorical …
The unsupervised detection of anomalies in time series data has important applications in user behavioral modeling, fraud detection, and cybersecurity. Anomaly detection has, in fact, been extensively studied in categorical sequences. However, we often have access to time series data that represent paths through networks. Examples include transaction sequences in financial networks, click streams of users in networks of cross-referenced documents, or travel itineraries in transportation networks. To reliably detect anomalies, we must account for the fact that such data contain a large number of independent observations of paths constrained by a graph topology. Moreover, the heterogeneity of real systems rules out frequency-based anomaly detection techniques, which do not account for highly skewed edge and degree statistics. To address this problem, we introduce HYPA, a novel framework for the unsupervised detection of anomalies in large corpora of variable-length temporal paths in a graph. HYPA provides an efficient analytical method to detect paths with anomalous frequencies that result from nodes being traversed in unexpected chronological order.
Graph embedding seeks to build a low-dimensional representation of a graph G. This low-dimensional representation is then used for various downstream tasks. One popular approach is Laplacian Eigenmaps, which constructs …
Graph embedding seeks to build a low-dimensional representation of a graph G. This low-dimensional representation is then used for various downstream tasks. One popular approach is Laplacian Eigenmaps, which constructs a graph embedding based on the spectral properties of the Laplacian matrix of G. The intuition behind it, and many other embedding techniques, is that the embedding of a graph must respect node similarity: similar nodes must have embeddings that are close to one another. Here, we dispose of this distance-minimization assumption. Instead, we use the Laplacian matrix to find an embedding with geometric properties instead of spectral ones, by leveraging the so-called simplex geometry of G. We introduce a new approach, Geometric Laplacian Eigenmap Embedding (or GLEE for short), and demonstrate that it outperforms various other techniques (including Laplacian Eigenmaps) in the tasks of graph reconstruction and link prediction.
The non-backtracking matrix and its eigenvalues have many applications in network science and graph mining, such as node and edge centrality, community detection, length spectrum theory, graph distance, and epidemic …
The non-backtracking matrix and its eigenvalues have many applications in network science and graph mining, such as node and edge centrality, community detection, length spectrum theory, graph distance, and epidemic and percolation thresholds. Moreover, in network epidemiology, the reciprocal of the largest eigenvalue of the non-backtracking matrix is a good approximation for the epidemic threshold of certain network dynamics. In this work, we develop techniques that identify which nodes have the largest impact on the leading non-backtracking eigenvalue. We do so by studying the behavior of the spectrum of the non-backtracking matrix after a node is removed from the graph. From this analysis we derive two new centrality measures: X-degree and X-non-backtracking centrality. We perform extensive experimentation with targeted immunization strategies derived from these two centrality measures. Our spectral analysis and centrality measures can be broadly applied, and will be of interest to both theorists and practitioners alike.
An important task for Homeland Security is the prediction of threat vulnerabilities, such as through the detection of relationships between seemingly disjoint entities. A structure used for this task is …
An important task for Homeland Security is the prediction of threat vulnerabilities, such as through the detection of relationships between seemingly disjoint entities. A structure used for this task is a "semantic graph", also known as a "relational data graph" or an "attributed relational graph". These graphs encode relationships as "typed" links between a pair of "typed" nodes. Indeed, semantic graphs are very similar to semantic networks used in AI. The node and link types are related through an ontology graph (also known as a schema). Furthermore, each node has a set of attributes associated with it (e.g., "age" may be an attribute of a node of type "person"). Unfortunately, the selection of types and attributes for both nodes and links depends on human expertise and is somewhat subjective and even arbitrary. This subjectiveness introduces biases into any algorithm that operates on semantic graphs. Here, we raise some knowledge representation issues for semantic graphs and provide some possible solutions using recently developed ideas in the field of complex networks. In particular, we use the concept of transitivity to evaluate the relevance of individual links in the semantic graph for detecting relationships. We also propose new statistical measures for semantic graphs and illustrate these semantic measures on graphs constructed from movies and terrorism data.
Networked representations of real-world phenomena are often partially observed, which lead to incomplete networks. Analysis of such incomplete networks can lead to skewed results. We examine the following problem: given …
Networked representations of real-world phenomena are often partially observed, which lead to incomplete networks. Analysis of such incomplete networks can lead to skewed results. We examine the following problem: given an incomplete network, which $b$ nodes should be probed to bring the largest number of new nodes into the observed network? Many graph-mining tasks require having observed a considerable amount of the network. Examples include community discovery, belief propagation, influence maximization, etc. For instance, consider someone who has observed a portion (say 1%) of the Twitter retweet network via random tweet sampling. She wants to estimate the size of the largest connected component of the fully observed retweet network. To improve her estimate, how should she use her limited budget to reduce the incompleteness of the network? In this work, we propose a novel algorithm, called MaxOutProbe, which uses a budget $b$ (on nodes probed) to increase the size of the observed network in terms of the number of nodes. Our experiments, across a range of datasets and conditions, demonstrate the advantages of MaxOutProbe over existing methods.
The COVID-19 pandemic offers an unprecedented natural experiment providing insights into the emergence of collective behavioral changes of both exogenous (government mandated) and endogenous (spontaneous reaction to infection risks) origin. …
The COVID-19 pandemic offers an unprecedented natural experiment providing insights into the emergence of collective behavioral changes of both exogenous (government mandated) and endogenous (spontaneous reaction to infection risks) origin. Here, we characterize collective physical distancing—mobility reductions, minimization of contacts, shortening of contact duration—in response to the COVID-19 pandemic in the pre-vaccine era by analyzing de-identified, privacy-preserving location data for a panel of over 5.5 million anonymized, opted-in U.S. devices. We define five indicators of users’ mobility and proximity to investigate how the emerging collective behavior deviates from typical pre-pandemic patterns during the first nine months of the COVID-19 pandemic. We analyze both the dramatic changes due to the government mandated mitigation policies and the more spontaneous societal adaptation into a new (physically distanced) normal in the fall 2020. Using the indicators here defined we show that: a) during the COVID-19 pandemic, collective physical distancing displayed different phases and was heterogeneous across geographies, b) metropolitan areas displayed stronger reductions in mobility and contacts than rural areas; c) stronger reductions in commuting patterns are observed in geographical areas with a higher share of teleworkable jobs; d) commuting volumes during and after the lockdown period negatively correlate with unemployment rates; and e) increases in contact indicators correlate with future values of new deaths at a lag consistent with epidemiological parameters and surveillance reporting delays. In conclusion, this study demonstrates that the framework and indicators here presented can be used to analyze large-scale social distancing phenomena, paving the way for their use in future pandemics to analyze and monitor the effects of pandemic mitigation plans at the national and international levels.
Abstract Complex networks are often either too large for full exploration, partially accessible, or partially observed. Downstream learning tasks on these incomplete networks can produce low quality results. In addition, …
Abstract Complex networks are often either too large for full exploration, partially accessible, or partially observed. Downstream learning tasks on these incomplete networks can produce low quality results. In addition, reducing the incompleteness of the network can be costly and nontrivial. As a result, network discovery algorithms optimized for specific downstream learning tasks given resource collection constraints are of great interest. In this paper, we formulate the task-specific network discovery problem as a sequential decision-making problem. Our downstream task is selective harvesting, the optimal collection of vertices with a particular attribute. We propose a framework, called network actor critic (NAC), which learns a policy and notion of future reward in an offline setting via a deep reinforcement learning algorithm. The NAC paradigm utilizes a task-specific network embedding to reduce the state space complexity. A detailed comparative analysis of popular network embeddings is presented with respect to their role in supporting offline planning. Furthermore, a quantitative study is presented on various synthetic and real benchmarks using NAC and several baselines. We show that offline models of reward and network discovery policies lead to significantly improved performance when compared to competitive online discovery algorithms. Finally, we outline learning regimes where planning is critical in addressing sparse and changing reward signals.
Recent self-propagating malware (SPM) campaigns compromised hundred of thousands of victim machines on the Internet. It is challenging to detect these attacks in their early stages, as adversaries utilize common …
Recent self-propagating malware (SPM) campaigns compromised hundred of thousands of victim machines on the Internet. It is challenging to detect these attacks in their early stages, as adversaries utilize common network services, use novel techniques, and can evade existing detection mechanisms. We propose PorTFILER (PORT-Level Network Traffic ProFILER), a new machine learning system applied to network traffic for detecting SPM attacks. PORTFILER extracts port-level features from the Zeek connection logs collected at a border of a monitored network, applies anomaly detection techniques to identify suspicious events, and ranks the alerts across ports for investigation by the Security Operations Center (SOC). We propose a novel ensemble methodology for aggregating individual models in PORTFILER that increases resilience against several evasion strategies compared to standard ML baselines. We extensively evaluate PorTFILER on traffic collected from two university networks, and show that it can detect SPM attacks with different patterns, such as WannaCry and Mirai, and performs well under evasion. Ranking across ports achieves precision over 0.94 and false positive rates below $8 \times 10^{-4}$ in the top 100 highly ranked alerts. When deployed on the university networks, PorTFILER detected anomalous SPM-like activity on one of the campus networks, confirmed by the university SOC as malicious. PortFILER also detected a Mirai attack recreated on the two university networks with higher precision and recall than deep-learning based autoencoder methods.
The cyber-threat landscape has evolved tremendously in recent years, with new threat variants emerging daily, and large-scale coordinated campaigns becoming more prevalent. In this study, we propose CELEST (CollaborativE LEarning …
The cyber-threat landscape has evolved tremendously in recent years, with new threat variants emerging daily, and large-scale coordinated campaigns becoming more prevalent. In this study, we propose CELEST (CollaborativE LEarning for Scalable Threat detection, a federated machine learning framework for global threat detection over HTTP, which is one of the most commonly used protocols for malware dissemination and communication. CELEST leverages federated learning in order to collaboratively train a global model across multiple clients who keep their data locally, thus providing increased privacy and confidentiality assurances. Through a novel active learning component integrated with the federated learning technique, our system continuously discovers and learns the behavior of new, evolving, and globally-coordinated cyber threats. We show that CELEST is able to expose attacks that are largely invisible to individual organizations. For instance, in one challenging attack scenario with data exfiltration malware, the global model achieves a three-fold increase in Precision-Recall AUC compared to the local model. We also design a poisoning detection and mitigation method, DTrust, specifically designed for federated learning in the collaborative threat detection domain. DTrust successfully detects poisoning clients using the feedback from participating clients to investigate and remove them from the training process. We deploy CELEST on two university networks and show that it is able to detect the malicious HTTP communication with high precision and low false positive rates. Furthermore, during its deployment, CELEST detected a set of previously unknown 42 malicious URLs and 20 malicious domains in one day, which were confirmed to be malicious by VirusTotal.
We present RAWLSNET, a system for altering Bayesian Network (BN) models to satisfy the Rawlsian principle of fair equality of opportunity (FEO). RAWLSNET's BN models generate aspirational data distributions: data …
We present RAWLSNET, a system for altering Bayesian Network (BN) models to satisfy the Rawlsian principle of fair equality of opportunity (FEO). RAWLSNET's BN models generate aspirational data distributions: data generated to reflect an ideally fair, FEO-satisfying society. FEO states that everyone with the same talent and willingness to use it should have the same chance of achieving advantageous social positions (e.g., employment), regardless of their background circumstances (e.g., socioeconomic status). Satisfying FEO requires alterations to social structures such as school assignments. Our paper describes RAWLSNET, a method which takes as input a BN representation of an FEO application and alters the BN's parameters so as to satisfy FEO when possible, and minimize deviation from FEO otherwise. We also offer guidance for applying RAWLSNET, including on recognizing proper applications of FEO. We demonstrate the use of RAWLSNET with publicly available data sets. RAWLSNET's altered BNs offer the novel capability of generating aspirational data for FEO-relevant tasks. Aspirational data are free from biases of real-world data, and thus are useful for recognizing and detecting sources of unfairness in machine learning algorithms besides biased data.
Abstract The structure of complex networks can be characterized by counting and analysing network motifs. Motifs are small graph structures that occur repeatedly in a network, such as triangles or …
Abstract The structure of complex networks can be characterized by counting and analysing network motifs. Motifs are small graph structures that occur repeatedly in a network, such as triangles or chains. Recent work has generalized motifs to temporal and dynamic network data. However, existing techniques do not generalize to sequential or trajectory data, which represent entities moving through the nodes of a network, such as passengers moving through transportation networks. The unit of observation in these data is fundamentally different since we analyse observations of trajectories (e.g. a trip from airport A to airport C through airport B), rather than independent observations of edges or snapshots of graphs over time. In this work, we define sequential motifs in trajectory data, which are small, directed and sequence-ordered graphs corresponding to patterns in observed sequences. We draw a connection between the counting and analysis of sequential motifs and Higher-Order Network (HON) models. We show that by mapping edges of a HON, specifically a $k$th-order DeBruijn graph, to sequential motifs, we can count and evaluate their importance in observed data. We test our methodology with two datasets: (1) passengers navigating an airport network and (2) people navigating the Wikipedia article network. We find that the most prevalent and important sequential motifs correspond to intuitive patterns of traversal in the real systems and show empirically that the heterogeneity of edge weights in an observed higher-order DeBruijn graph has implications for the distributions of sequential motifs we expect to see across our null models.
Given a set of k networks, possibly with different sizes and no overlaps in nodes or edges, how can we quickly assess similarity between them, without solving the node-correspondence problem? …
Given a set of k networks, possibly with different sizes and no overlaps in nodes or edges, how can we quickly assess similarity between them, without solving the node-correspondence problem? Analogously, how can we extract a small number of descriptive, numerical features from each graph that effectively serve as the graph's "signature"? Having such features will enable a wealth of graph mining tasks, including clustering, outlier detection, visualization, etc. We propose NetSimile -- a novel, effective, and scalable method for solving the aforementioned problem. NetSimile has the following desirable properties: (a) It gives similarity scores that are size-invariant. (b) It is scalable, being linear on the number of edges for "signature" vector extraction. (c) It does not need to solve the node-correspondence problem. We present extensive experiments on numerous synthetic and real graphs from disparate domains, and show NetSimile's superiority over baseline competitors. We also show how NetSimile enables several mining tasks such as clustering, visualization, discontinuity detection, network transfer learning, and re-identification across networks.
The issue of bias (i.e., systematic unfairness) in machine learning models has recently attracted the attention of both researchers and practitioners. For the graph mining community in particular, an important …
The issue of bias (i.e., systematic unfairness) in machine learning models has recently attracted the attention of both researchers and practitioners. For the graph mining community in particular, an important goal toward algorithmic fairness is to detect and mitigate bias incorporated into graph embeddings since they are commonly used in human-centered applications, e.g., social-media recommendations. However, simple analytical methods for detecting bias typically involve aggregate statistics which do not reveal the sources of unfairness. Instead, visual methods can provide a holistic fairness characterization of graph embeddings and help uncover the causes of observed bias. In this work, we present BIAS <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">COPE</inf> , an interactive visualization tool that supports end-to-end visual unfairness diagnosis for graph embeddings. The tool is the product of a design study in collaboration with domain experts. It allows the user to (i) visually compare two embeddings with respect to fairness, (ii) locate nodes or graph communities that are unfairly embedded, and (iii) understand the source of bias by interactively linking the relevant embedding subspace with the corresponding graph topology. Experts' feedback confirms that our tool is effective at detecting and diagnosing unfairness. Thus, we envision our tool both as a companion for researchers in designing their algorithms as well as a guide for practitioners who use off-the-shelf graph embeddings.
Whether comparing networks to each other or to random expectation, measuring dissimilarity is essential to understanding the complex phenomena under study. However, determining the structural dissimilarity between networks is an …
Whether comparing networks to each other or to random expectation, measuring dissimilarity is essential to understanding the complex phenomena under study. However, determining the structural dissimilarity between networks is an ill-defined problem, as there is no canonical way to compare two networks. Indeed, many of the existing approaches for network comparison differ in their heuristics, efficiency, interpretability, and theoretical soundness. Thus, having a notion of distance that is built on theoretically robust first principles and that is interpretable with respect to features ubiquitous in complex networks would allow for a meaningful comparison between different networks. Here we introduce a theoretically sound and efficient new measure of graph distance, based on the "length spectrum" function from algebraic topology, which compares the structure of two undirected, unweighted graphs by considering their non-backtracking cycles. We show how this distance relates to structural features such as presence of hubs and triangles through the behavior of the eigenvalues of the so-called non-backtracking matrix, and we showcase its ability to discriminate between networks in both real and synthetic data sets. By taking a topological interpretation of non-backtracking cycles, this work presents a novel application of Topological Data Analysis to the study of complex networks.
Many real-world prediction tasks have outcome variables that have characteristic heavy-tail distributions. Examples include copies of books sold, auction prices of art pieces, demand for commodities in warehouses, etc. By …
Many real-world prediction tasks have outcome variables that have characteristic heavy-tail distributions. Examples include copies of books sold, auction prices of art pieces, demand for commodities in warehouses, etc. By learning heavy-tailed distributions, big and rare instances (e.g., the best-sellers) will have accurate predictions. Most existing approaches are not dedicated to learning heavy-tailed distribution; thus, they heavily under-predict such instances. To tackle this problem, we introduce Learning to Place (L2P), which exploits the pairwise relationships between instances for learning. In its training phase, L2P learns a pairwise preference classifier: is instance A > instance B? In its placing phase, L2P obtains a prediction by placing the new instance among the known instances. Based on its placement, the new instance is then assigned a value for its outcome variable. Experiments on real data show that L2P outperforms competing approaches in terms of accuracy and ability to reproduce heavy-tailed outcome distribution. In addition, L2P provides an interpretable model by placing each predicted instance in relation to its comparable neighbors. Interpretable models are highly desirable when lives and treasure are at stake.
Networked representations of real-world phenomena are often partially observed, which lead to incomplete networks. Analysis of such incomplete networks can lead to skewed results. We examine the following problem: given …
Networked representations of real-world phenomena are often partially observed, which lead to incomplete networks. Analysis of such incomplete networks can lead to skewed results. We examine the following problem: given an incomplete network, which $b$ nodes should be probed to bring the largest number of new nodes into the observed network? Many graph-mining tasks require having observed a considerable amount of the network. Examples include community discovery, belief propagation, influence maximization, etc. For instance, consider someone who has observed a portion (say 1%) of the Twitter retweet network via random tweet sampling. She wants to estimate the size of the largest connected component of the fully observed retweet network. To improve her estimate, how should she use her limited budget to reduce the incompleteness of the network? In this work, we propose a novel algorithm, called MaxOutProbe, which uses a budget $b$ (on nodes probed) to increase the size of the observed network in terms of the number of nodes. Our experiments, across a range of datasets and conditions, demonstrate the advantages of MaxOutProbe over existing methods.
Over the past decade, machine learning has revolutionized computers' ability to analyze text through flexible computational models. Due to their structural similarity to written language, transformer-based architectures have also shown …
Over the past decade, machine learning has revolutionized computers' ability to analyze text through flexible computational models. Due to their structural similarity to written language, transformer-based architectures have also shown promise as tools to make sense of a range of multi-variate sequences from protein-structures, music, electronic health records to weather-forecasts. We can also represent human lives in a way that shares this structural similarity to language. From one perspective, lives are simply sequences of events: People are born, visit the pediatrician, start school, move to a new location, get married, and so on. Here, we exploit this similarity to adapt innovations from natural language processing to examine the evolution and predictability of human lives based on detailed event sequences. We do this by drawing on arguably the most comprehensive registry data in existence, available for an entire nation of more than six million individuals across decades. Our data include information about life-events related to health, education, occupation, income, address, and working hours, recorded with day-to-day resolution. We create embeddings of life-events in a single vector space showing that this embedding space is robust and highly structured. Our models allow us to predict diverse outcomes ranging from early mortality to personality nuances, outperforming state-of-the-art models by a wide margin. Using methods for interpreting deep learning models, we probe the algorithm to understand the factors that enable our predictions. Our framework allows researchers to identify new potential mechanisms that impact life outcomes and associated possibilities for personalized interventions.
The rise of large-scale socio-technical systems in which humans interact with artificial intelligence (AI) systems (including assistants and recommenders, in short AIs) multiplies the opportunity for the emergence of collective …
The rise of large-scale socio-technical systems in which humans interact with artificial intelligence (AI) systems (including assistants and recommenders, in short AIs) multiplies the opportunity for the emergence of collective phenomena and tipping points, with unexpected, possibly unintended, consequences. For example, navigation systems' suggestions may create chaos if too many drivers are directed on the same route, and personalised recommendations on social media may amplify polarisation, filter bubbles, and radicalisation. On the other hand, we may learn how to foster the "wisdom of crowds" and collective action effects to face social and environmental challenges. In order to understand the impact of AI on socio-technical systems and design next-generation AIs that team with humans to help overcome societal problems rather than exacerbate them, we propose to build the foundations of Social AI at the intersection of Complex Systems, Network Science and AI. In this perspective paper, we discuss the main open questions in Social AI, outlining possible technical and scientific challenges and suggesting research avenues.
The first Workshop on Integrity in Social Networks and Media is held in conjunction with the 13th ACM Conference on Web Search and Data Mining (WSDM) in Houston, Texas, USA. …
The first Workshop on Integrity in Social Networks and Media is held in conjunction with the 13th ACM Conference on Web Search and Data Mining (WSDM) in Houston, Texas, USA. The goal of the workshop is to bring together researchers and practitioners to discuss content and interaction integrity challenges in social networks and social media platforms.
Complex systems thinking is applied to a wide variety of domains, from neuroscience to computer science and economics. The wide variety of implementations has resulted in two key challenges: the …
Complex systems thinking is applied to a wide variety of domains, from neuroscience to computer science and economics. The wide variety of implementations has resulted in two key challenges: the progenation of many domain-specific strategies that are seldom revisited or questioned, and the siloing of ideas within a domain due to inconsistency of complex systems language. In this work we offer basic, domain-agnostic language in order to advance towards a more cohesive vocabulary. We use this language to evaluate each step of the complex systems analysis pipeline, beginning with the system and data collected, then moving through different mathematical formalisms for encoding the observed data (i.e. graphs, simplicial complexes, and hypergraphs), and relevant computational methods for each formalism. At each step we consider different types of \emph{dependencies}; these are properties of the system that describe how the existence of one relation among the parts of a system may influence the existence of another relation. We discuss how dependencies may arise and how they may alter interpretation of results or the entirety of the analysis pipeline. We close with two real-world examples using coauthorship data and email communications data that illustrate how the system under study, the dependencies therein, the research question, and choice of mathematical representation influence the results. We hope this work can serve as an opportunity of reflection for experienced complexity scientists, as well as an introductory resource for new researchers.
Vertex classification is vulnerable to perturbations of both graph topology and vertex attributes, as shown in recent research. As in other machine learning domains, concerns about robustness to adversarial manipulation …
Vertex classification is vulnerable to perturbations of both graph topology and vertex attributes, as shown in recent research. As in other machine learning domains, concerns about robustness to adversarial manipulation can prevent potential users from adopting proposed methods when the consequence of action is very high. This paper considers two topological characteristics of graphs and explores the way these features affect the amount the adversary must perturb the graph in order to be successful. We show that, if certain vertices are included in the training set, it is possible to substantially an adversary's required perturbation budget. On four citation datasets, we demonstrate that if the training set includes high degree vertices or vertices that ensure all unlabeled nodes have neighbors in the training set, we show that the adversary's budget often increases by a substantial factor---often a factor of 2 or more---over random training for the Nettack poisoning attack. Even for especially easy targets (those that are misclassified after just one or two perturbations), the degradation of performance is much slower, assigning much lower probabilities to the incorrect classes. In addition, we demonstrate that this robustness either persists when recently proposed defenses are applied, or is competitive with the resulting performance improvement for the defender.
Graph embedding seeks to build a low-dimensional representation of a graph G. This low-dimensional representation is then used for various downstream tasks. One popular approach is Laplacian Eigenmaps, which constructs …
Graph embedding seeks to build a low-dimensional representation of a graph G. This low-dimensional representation is then used for various downstream tasks. One popular approach is Laplacian Eigenmaps, which constructs a graph embedding based on the spectral properties of the Laplacian matrix of G. The intuition behind it, and many other embedding techniques, is that the embedding of a graph must respect node similarity: similar nodes must have embeddings that are close to one another. Here, we dispose of this distance-minimization assumption. Instead, we use the Laplacian matrix to find an embedding with geometric properties instead of spectral ones, by leveraging the so-called simplex geometry of G. We introduce a new approach, Geometric Laplacian Eigenmap Embedding (or GLEE for short), and demonstrate that it outperforms various other techniques (including Laplacian Eigenmaps) in the tasks of graph reconstruction and link prediction.
This is the fifth in a series of white papers providing a summary of the discussions and future directions that are derived from these topical meetings. This paper focuses on …
This is the fifth in a series of white papers providing a summary of the discussions and future directions that are derived from these topical meetings. This paper focuses on issues related to analysis and visual analytics. While these two topics are distinct, there are clear overlaps between the two. It is common to use different visualizations during analysis and given the sheer volume of social media data, visual analytic tools can be important during analysis, as well as during other parts of the research lifecycle. Choices about analysis may be informed by visualization plans and vice versa - both are key in communicating about a data set and what it means. We also recognized that each field of research has different analysis techniques and different levels of familiarity with visual analytics. Putting these two topics into the same meeting provided us with the opportunity to think about analysis and visual analytics/visualization in new, synergistic ways.
The brain is a complex system comprising a myriad of interacting elements, posing significant challenges in understanding its structure, function, and dynamics. Network science has emerged as a powerful tool …
The brain is a complex system comprising a myriad of interacting elements, posing significant challenges in understanding its structure, function, and dynamics. Network science has emerged as a powerful tool for studying such intricate systems, offering a framework for integrating multiscale data and complexity. Here, we discuss the application of network science in the study of the brain, addressing topics such as network models and metrics, the connectome, and the role of dynamics in neural networks. We explore the challenges and opportunities in integrating multiple data streams for understanding the neural transitions from development to healthy function to disease, and discuss the potential for collaboration between network science and neuroscience communities. We underscore the importance of fostering interdisciplinary opportunities through funding initiatives, workshops, and conferences, as well as supporting students and postdoctoral fellows with interests in both disciplines. By uniting the network science and neuroscience communities, we can develop novel network-based methods tailored to neural circuits, paving the way towards a deeper understanding of the brain and its functions.
Identifying shortest paths between nodes in a network is a common graph analysis problem that is important for many applications involving routing of resources. An adversary that can manipulate the …
Identifying shortest paths between nodes in a network is a common graph analysis problem that is important for many applications involving routing of resources. An adversary that can manipulate the graph structure could alter traffic patterns to gain some benefit (e.g., make more money by directing traffic to a toll road). This article presents the Force Path Cut problem, in which an adversary removes edges from a graph to make a particular path the shortest between its terminal nodes. We prove that the optimization version of this problem is APX-hard but introduce PATHATTACK , a polynomial-time approximation algorithm that guarantees a solution within a logarithmic factor of the optimal value. In addition, we introduce the Force Edge Cut and Force Node Cut problems, in which the adversary targets a particular edge or node, respectively, rather than an entire path. We derive a nonconvex optimization formulation for these problems and derive a heuristic algorithm that uses PATHATTACK as a subroutine. We demonstrate all of these algorithms on a diverse set of real and synthetic networks, illustrating where the proposed algorithms provide the greatest improvement over baseline methods.
Role discovery in graphs is an emerging area that allows analysis of complex graphs in an intuitive way. In contrast to other graph prob- lems such as community discovery, which …
Role discovery in graphs is an emerging area that allows analysis of complex graphs in an intuitive way. In contrast to other graph prob- lems such as community discovery, which finds groups of highly connected nodes, the role discovery problem finds groups of nodes that share similar graph topological structure. However, existing work so far has two severe limitations that prevent its use in some domains. Firstly, it is completely unsupervised which is undesirable for a number of reasons. Secondly, most work is limited to a single relational graph. We address both these lim- itations in an intuitive and easy to implement alternating least squares framework. Our framework allows convex constraints to be placed on the role discovery problem which can provide useful supervision. In par- ticular we explore supervision to enforce i) sparsity, ii) diversity and iii) alternativeness. We then show how to lift this work for multi-relational graphs. A natural representation of a multi-relational graph is an order 3 tensor (rather than a matrix) and that a Tucker decomposition allows us to find complex interactions between collections of entities (E-groups) and the roles they play for a combination of relations (R-groups). Existing Tucker decomposition methods in tensor toolboxes are not suited for our purpose, so we create our own algorithm that we demonstrate is pragmatically useful.
Studies of networked phenomena, such as interactions in online social media, often rely on incomplete data, either because these phenomena are partially observed, or because the data is too large …
Studies of networked phenomena, such as interactions in online social media, often rely on incomplete data, either because these phenomena are partially observed, or because the data is too large or expensive to acquire all at once. Analysis of incomplete data leads to skewed or misleading results. In this paper, we investigate limitations of learning to complete partially observed networks via node querying. Concretely, we study the following problem: given (i) a partially observed network, (ii) the ability to query nodes for their connections (e.g., by accessing an API), and (iii) a budget on the number of such queries, sequentially learn which nodes to query in order to maximally increase observability. We call this querying process Network Online Learning and present a family of algorithms called NOL*. These algorithms learn to choose which partially observed node to query next based on a parameterized model that is trained online through a process of exploration and exploitation. Extensive experiments on both synthetic and real world networks show that (i) it is possible to sequentially learn to choose which nodes are best to query in a network and (ii) some macroscopic properties of networks, such as the degree distribution and modular structure, impact the potential for learning and the optimal amount of random exploration.
Vertex classification is vulnerable to perturbations of both graph topology and vertex attributes, as shown in recent research. As in other machine learning domains, concerns about robustness to adversarial manipulation …
Vertex classification is vulnerable to perturbations of both graph topology and vertex attributes, as shown in recent research. As in other machine learning domains, concerns about robustness to adversarial manipulation can prevent potential users from adopting proposed methods when the consequence of action is very high. This paper considers two topological characteristics of graphs and explores the way these features affect the amount the adversary must perturb the graph in order to be successful. We show that, if certain vertices are included in the training set, it is possible to substantially an adversary's required perturbation budget. On four citation datasets, we demonstrate that if the training set includes high degree vertices or vertices that ensure all unlabeled nodes have neighbors in the training set, we show that the adversary's budget often increases by a substantial factor---often a factor of 2 or more---over random training for the Nettack poisoning attack. Even for especially easy targets (those that are misclassified after just one or two perturbations), the degradation of performance is much slower, assigning much lower probabilities to the incorrect classes. In addition, we demonstrate that this robustness either persists when recently proposed defenses are applied, or is competitive with the resulting performance improvement for the defender.
Shortest paths in complex networks play key roles in many applications. Examples include routing packets in a computer network, routing traffic on a transportation network, and inferring semantic distances between …
Shortest paths in complex networks play key roles in many applications. Examples include routing packets in a computer network, routing traffic on a transportation network, and inferring semantic distances between concepts on the World Wide Web. An adversary with the capability to perturb the graph might make the shortest path between two nodes route traffic through advantageous portions of the graph (e.g., a toll road he owns). In this paper, we introduce the Force Path Cut problem, in which there is a specific route the adversary wants to promote by removing a minimum number of edges in the graph. We show that Force Path Cut is NP-complete, but also that it can be recast as an instance of the Weighted Set Cover problem, enabling the use of approximation algorithms. The size of the universe for the set cover problem is potentially factorial in the number of nodes. To overcome this hurdle, we propose the PATHATTACK algorithm, which via constraint generation considers only a small subset of paths -- at most 5% of the number of edges in 99% of our experiments. Across a diverse set of synthetic and real networks, the linear programming formulation of Weighted Set Cover yields the optimal solution in over 98% of cases. We also demonstrate a time/cost tradeoff using two approximation algorithms and greedy baseline methods. This work provides a foundation for addressing similar problems and expands the area of adversarial graph mining beyond recent work on node classification and embedding.
We present RAWLSNET, a system for altering Bayesian Network (BN) models to satisfy the Rawlsian principle of fair equality of opportunity (FEO). RAWLSNET's BN models generate aspirational data distributions: data …
We present RAWLSNET, a system for altering Bayesian Network (BN) models to satisfy the Rawlsian principle of fair equality of opportunity (FEO). RAWLSNET's BN models generate aspirational data distributions: data generated to reflect an ideally fair, FEO-satisfying society. FEO states that everyone with the same talent and willingness to use it should have the same chance of achieving advantageous social positions (e.g., employment), regardless of their background circumstances (e.g., socioeconomic status). Satisfying FEO requires alterations to social structures such as school assignments. Our paper describes RAWLSNET, a method which takes as input a BN representation of an FEO application and alters the BN's parameters so as to satisfy FEO when possible, and minimize deviation from FEO otherwise. We also offer guidance for applying RAWLSNET, including on recognizing proper applications of FEO. We demonstrate the use of our system with publicly available data sets. RAWLSNET's altered BNs offer the novel capability of generating aspirational data for FEO-relevant tasks. Aspirational data are free from the biases of real-world data, and thus are useful for recognizing and detecting sources of unfairness in machine learning algorithms besides biased data.
It is well known that networks generated by common mechanisms such as preferential attachment and homophily can disadvantage the minority group by limiting their ability to establish links with the …
It is well known that networks generated by common mechanisms such as preferential attachment and homophily can disadvantage the minority group by limiting their ability to establish links with the majority group. This has the effect of limiting minority nodes' access to information. In this paper, we present the results of an empirical study on the equality of information access in network models with different growth mechanisms and spreading processes. For growth mechanisms, we focus on the majority/minority dichotomy, homophily, preferential attachment, and diversity. For spreading processes, we investigate simple vs. complex contagions, different transmission rates within and between groups, and various seeding conditions. We observe two phenomena. First, information access equality is a complex interplay between network structures and the spreading processes. Second, there is a trade-off between equality and efficiency of information access under certain circumstances (e.g., when inter-group edges are low and information transmits asymmetrically). Our findings can be used to make recommendations for mechanistic design of social networks.
The COVID-19 pandemic offers an unprecedented natural experiment providing insights into the emergence of collective behavioral changes of both exogenous (government mandated) and endogenous (spontaneous reaction to infection risks) origin. …
The COVID-19 pandemic offers an unprecedented natural experiment providing insights into the emergence of collective behavioral changes of both exogenous (government mandated) and endogenous (spontaneous reaction to infection risks) origin. Here, we characterize collective physical distancing -- mobility reductions, minimization of contacts, shortening of contact duration -- in response to the COVID-19 pandemic in the pre-vaccine era by analyzing de-identified, privacy-preserving location data for a panel of over 5.5 million anonymized, opted-in U.S. devices. We define five indicators of users' mobility and proximity to investigate how the emerging collective behavior deviates from the typical pre-pandemic patterns during the first nine months of the COVID-19 pandemic. We analyze both the dramatic changes due to the government mandated mitigation policies and the more spontaneous societal adaptation into a new (physically distanced) normal in the fall 2020. The indicators defined here allow the quantification of behavior changes across the rural/urban divide and highlight the statistical association of mobility and proximity indicators with metrics characterizing the pandemic's social and public health impact such as unemployment and deaths. This study provides a framework to study massive social distancing phenomena with potential uses in analyzing and monitoring the effects of pandemic mitigation plans at the national and international level.
Suppose there is a spreading process propagating on a weighted graph. Denote the graph's weight matrix as W. How would we reduce the number of nodes affected during the process? …
Suppose there is a spreading process propagating on a weighted graph. Denote the graph's weight matrix as W. How would we reduce the number of nodes affected during the process? This question appears in recent studies about counterfactual outcomes of implementing edge-weight interventions on mobility networks (Chang et al. (2021)). A practical algorithm to reduce infections is by removing edges with the highest edge centrality, defined as the product of two adjacent nodes' eigen- scores (Tong et al. (2012)). In this work, we design edge-weight reduction algorithms on static and time- varying weighted networks with theoretical guarantees. First, we prove that edge centrality equals the gradient of the largest eigenvalue of WW⊤ (over W) and generalize the gradient for the largest r eigenvalues of WW⊤. Second, we design a Frank-Wolfe algorithm for finding the optimal edge-weight reduction to shrink the largest r eigenvalues of WW⊤ under any reduction budget. Third, we extend our algorithm to time-varying networks with guaranteed optimality. We perform a detailed empirical study to validate our approach. Our algorithm significantly reduces the number of infections compared with existing methods on eleven weighted networks. Further, we illustrate several properties of our algorithm: the benefit of choosing r, fast convergence to the optimum, and a linear-scale runtime per iteration.
Identifying shortest paths between nodes in a network is an important task in applications involving routing of resources. Recent work has shown that a malicious actor can manipulate a graph …
Identifying shortest paths between nodes in a network is an important task in applications involving routing of resources. Recent work has shown that a malicious actor can manipulate a graph to make traffic between two nodes of interest follow their target path. In this paper, we develop a defense against such attacks by modifying the weights of the graph that users observe. The defender must balance inhibiting the attacker against any negative effects of the defense on benign users. Specifically, the defender's goals are: (a) to recommend the shortest paths possible to users, (b) for the lengths of the shortest paths in the published graph to be close to those of the same paths in the true graph, and (c) to minimize the probability of an attack. We formulate the defense as a Stackelberg game in which the defender is the leader and the attacker is the follower. In this context, we also consider a zero-sum version of the game, in which the defender's goal is to minimize cost while achieving the minimum possible attack probability. We show that this problem is NP-hard and propose heuristic solutions based on increasing edge weights along target paths in both the zero-sum and non-zero-sum settings. Relaxing some constraints of the original problem, we formulate a linear program for local optimization around a feasible point. We present defense results with both synthetic and real network datasets and show that these methods often reach the lower bound of the defender's cost.
Link prediction is a crucial task in graph machine learning with diverse applications. We explore the interplay between node attributes and graph topology and demonstrate that incorporating pre-trained node attributes …
Link prediction is a crucial task in graph machine learning with diverse applications. We explore the interplay between node attributes and graph topology and demonstrate that incorporating pre-trained node attributes improves the generalization power of link prediction models. Our proposed method, UPNA (Unsupervised Pre-training of Node Attributes), solves the inductive link prediction problem by learning a function that takes a pair of node attributes and predicts the probability of an edge, as opposed to Graph Neural Networks (GNN), which can be prone to topological shortcuts in graphs with power-law degree distribution. In this manner, UPNA learns a significant part of the latent graph generation mechanism since the learned function can be used to add incoming nodes to a growing graph. By leveraging pre-trained node attributes, we overcome observational bias and make meaningful predictions about unobserved nodes, surpassing state-of-the-art performance (3X to 34X improvement on benchmark datasets). UPNA can be applied to various pairwise learning tasks and integrated with existing link prediction models to enhance their generalizability and bolster graph generative models.
Abstract In this work, we explore multiplex graph (networks with different types of edges) generation with deep generative models. We discuss some of the challenges associated with multiplex graph generation …
Abstract In this work, we explore multiplex graph (networks with different types of edges) generation with deep generative models. We discuss some of the challenges associated with multiplex graph generation that make it a more difficult problem than traditional graph generation. We propose T en GAN, the first neural network for multiplex graph generation, which greatly reduces the number of parameters required for multiplex graph generation. We also propose 3 different criteria for evaluating the quality of generated graphs: a graph-attribute-based, a classifier-based, and a tensor-based method. We evaluate its performance on 4 datasets and show that it generally performs better than other existing statistical multiplex graph generative models. We also adapt HGEN, an existing deep generative model for heterogeneous information networks, to work for multiplex graphs and show that our method generally performs better.
Hypergraphs are a natural and powerful choice for modeling group interactions in the real world, which are often referred to as higher-order networks. For example, when modeling collaboration networks, where …
Hypergraphs are a natural and powerful choice for modeling group interactions in the real world, which are often referred to as higher-order networks. For example, when modeling collaboration networks, where collaborations can involve not just two but three or more people, employing hypergraphs allows us to explore beyond pairwise (dyadic) patterns and capture groupwise (polyadic) patterns. The mathematical complexity of hypergraphs offers both opportunities and challenges for learning and mining on hypergraphs, and hypergraph mining, which seeks to enhance our understanding of underlying systems through hypergraph modeling, gained increasing attention in research. Researchers have discovered various structural patterns in real-world hypergraphs, leading to the development of mining tools. Moreover, they have designed generators with the aim of reproducing and thereby shedding light on these patterns. In this survey, we provide a comprehensive overview of the current landscape of hypergraph mining, covering patterns, tools, and generators. We provide comprehensive taxonomies for them, and we also provide in-depth discussions to provide insights into future research on hypergraph mining.
Abstract When dealing with large graphs, community detection is a useful data triage tool that can identify subsets of the network that a data analyst should investigate. In an adversarial …
Abstract When dealing with large graphs, community detection is a useful data triage tool that can identify subsets of the network that a data analyst should investigate. In an adversarial scenario, the graph may be manipulated to avoid scrutiny of certain nodes by the analyst. Robustness to such behaviour is an important consideration for data analysts in high-stakes scenarios such as cyber defense and counterterrorism. In this paper, we evaluate the use of overlapping community detection methods in the presence of adversarial attacks aimed at lowering the priority of a specific vertex. We formulate the data analyst’s choice as a Stackelberg game in which the analyst chooses a community detection method and the attacker chooses an attack strategy in response. Applying various attacks from the literature to nine real network datasets, we find that, when the attacker has a sufficient budget, overlapping community detection methods outperform non-overlapping methods, often overwhelmingly so. This is the case when the attacker can only add edges that connect to the target and when the capability is added to add edges between neighbours of the target. We also analyze the tradeoff between robustness in the presence of an attack and performance when there is no attack. Our extensible analytic framework enables network data analysts to take these considerations into account and incorporate new attacks and community detection methods as they are developed.
Hypergraphs, which belong to the family of higher-order networks, are a natural and powerful choice for modeling group interactions in the real world. For example, when modeling collaboration networks, which …
Hypergraphs, which belong to the family of higher-order networks, are a natural and powerful choice for modeling group interactions in the real world. For example, when modeling collaboration networks, which may involve not just two but three or more people, the use of hypergraphs allows us to explore beyond pairwise (dyadic) patterns and capture groupwise (polyadic) patterns. The mathematical complexity of hypergraphs offers both opportunities and challenges for hypergraph mining. The goal of hypergraph mining is to find structural properties recurring in real-world hypergraphs across different domains, which we call patterns. To find patterns, we need tools. We divide hypergraph mining tools into three categories: (1) null models (which help test the significance of observed patterns), (2) structural elements (i.e., substructures in a hypergraph such as open and closed triangles), and (3) structural quantities (i.e., numerical tools for computing hypergraph patterns such as transitivity). There are also hypergraph generators, whose objective is to produce synthetic hypergraphs that are a faithful representation of real-world hypergraphs. In this survey, we provide a comprehensive overview of the current landscape of hypergraph mining, covering patterns, tools, and generators. We provide comprehensive taxonomies for each and offer in-depth discussions for future research on hypergraph mining.
Analysis of single-cell RNA sequencing data is often conducted through network projections such as coexpression networks, primarily due to the abundant availability of network analysis tools for downstream tasks. However, …
Analysis of single-cell RNA sequencing data is often conducted through network projections such as coexpression networks, primarily due to the abundant availability of network analysis tools for downstream tasks. However, this approach has several limitations: loss of higher-order information, inefficient data representation caused by converting a sparse dataset to a fully connected network, and overestimation of coexpression due to zero-inflation. To address these limitations, we propose conceptualizing scRNA-seq expression data as hypergraphs, which are generalized graphs in which the hyperedges can connect more than two vertices. In the context of scRNA-seq data, the hypergraph nodes represent cells and the edges represent genes. Each hyperedge connects all cells where its corresponding gene is actively expressed and records the expression of the gene across different cells. This hypergraph conceptualization enables us to explore multi-way relationships beyond the pairwise interactions in coexpression networks without loss of information. We propose two novel clustering methods: (1) the Dual-Importance Preference Hypergraph Walk (DIPHW) and (2) the Coexpression and Memory-Integrated Dual-Importance Preference Hypergraph Walk (CoMem-DIPHW). They outperform established methods on both simulated and real scRNA-seq datasets. The improvement brought by our proposed methods is especially significant when data modularity is weak. Furthermore, CoMem-DIPHW incorporates the gene coexpression network, cell coexpression network, and the cell-gene expression hypergraph from the single-cell abundance counts data altogether for embedding computation. This approach accounts for both the local level information from single-cell level gene expression and the global level information from the pairwise similarity in the two coexpression networks.
Classifying genome sequences based on metadata has been an active area of research in comparative genomics for decades with many important applications across the life sciences. Established methods for classifying …
Classifying genome sequences based on metadata has been an active area of research in comparative genomics for decades with many important applications across the life sciences. Established methods for classifying genomes can be broadly grouped into sequence alignment-based and alignment-free models. Conventional alignment-based models rely on genome similarity measures calculated based on local sequence alignments or consistent ordering among sequences. However, such methods are computationally expensive when dealing with large ensembles of even moderately sized genomes. In contrast, alignment-free (AF) approaches measure genome similarity based on summary statistics in an unsupervised setting and are efficient enough to analyze large datasets. However, both alignment-based and AF methods typically assume fixed scoring rubrics that lack the flexibility to assign varying importance to different parts of the sequences based on prior knowledge. In this study, we integrate AI and network science approaches to develop a comparative genomic analysis framework that addresses these limitations. Our approach, termed the Genome Misclassification Network Analysis (GMNA), simultaneously leverages misclassified instances, a learned scoring rubric, and label information to classify genomes based on associated metadata and better understand potential drivers of misclassification. We evaluate the utility of the GMNA using Naive Bayes and convolutional neural network models, supplemented by additional experiments with transformer-based models, to construct SARS-CoV-2 sampling location classifiers using over 500,000 viral genome sequences and study the resulting network of misclassifications. We demonstrate the global health potential of the GMNA by leveraging the SARS-CoV-2 genome misclassification networks to investigate the role human mobility played in structuring geographic clustering of SARS-CoV-2.
Machine learning models for graphs in real-world applications are prone to two primary types of uncertainty: (1) those that arise from incomplete and noisy data and (2) those that arise …
Machine learning models for graphs in real-world applications are prone to two primary types of uncertainty: (1) those that arise from incomplete and noisy data and (2) those that arise from uncertainty of the model in its output. These sources of uncertainty are not mutually exclusive. Additionally, models are susceptible to targeted adversarial attacks, which exacerbate both of these uncertainties. In this work, we introduce Radius Enhanced Graph Embeddings (REGE), an approach that measures and incorporates uncertainty in data to produce graph embeddings with radius values that represent the uncertainty of the model's output. REGE employs curriculum learning to incorporate data uncertainty and conformal learning to address the uncertainty in the model's output. In our experiments, we show that REGE's graph embeddings perform better under adversarial attacks by an average of 1.5% (accuracy) against state-of-the-art methods.
Abstract When dealing with large graphs, community detection is a useful data triage tool that can identify subsets of the network that a data analyst should investigate. In an adversarial …
Abstract When dealing with large graphs, community detection is a useful data triage tool that can identify subsets of the network that a data analyst should investigate. In an adversarial scenario, the graph may be manipulated to avoid scrutiny of certain nodes by the analyst. Robustness to such behaviour is an important consideration for data analysts in high-stakes scenarios such as cyber defense and counterterrorism. In this paper, we evaluate the use of overlapping community detection methods in the presence of adversarial attacks aimed at lowering the priority of a specific vertex. We formulate the data analyst’s choice as a Stackelberg game in which the analyst chooses a community detection method and the attacker chooses an attack strategy in response. Applying various attacks from the literature to nine real network datasets, we find that, when the attacker has a sufficient budget, overlapping community detection methods outperform non-overlapping methods, often overwhelmingly so. This is the case when the attacker can only add edges that connect to the target and when the capability is added to add edges between neighbours of the target. We also analyze the tradeoff between robustness in the presence of an attack and performance when there is no attack. Our extensible analytic framework enables network data analysts to take these considerations into account and incorporate new attacks and community detection methods as they are developed.
Node embedding algorithms produce low-dimensional latent representations of nodes in a graph. These embeddings are often used for downstream tasks, such as node classification and link prediction. In this paper, …
Node embedding algorithms produce low-dimensional latent representations of nodes in a graph. These embeddings are often used for downstream tasks, such as node classification and link prediction. In this paper, we investigate the following two questions: (Q1) Can we explain each embedding dimension with human-understandable graph features (e.g. degree, clustering coefficient and PageRank). (Q2) How can we modify existing node embedding algorithms to produce embeddings that can be easily explained by human-understandable graph features? We find that the answer to Q1 is yes and introduce a new framework called XM (short for eXplain eMbedding) to answer Q2. A key aspect of XM involves minimizing the nuclear norm of the generated explanations. We show that by minimizing the nuclear norm, we minimize the lower bound on the entropy of the generated explanations. We test XM on a variety of real-world graphs and show that XM not only preserves the performance of existing node embedding methods, but also enhances their explainability.
A wide range of graph embedding objectives decompose into two components: one that attracts the embeddings of nodes that are perceived as similar, and another that repels embeddings of nodes …
A wide range of graph embedding objectives decompose into two components: one that attracts the embeddings of nodes that are perceived as similar, and another that repels embeddings of nodes that are perceived as dissimilar. Because real-world graphs are sparse and the number of dissimilar pairs grows quadratically with the number of nodes, Skip-Gram Negative Sampling (SGNS) has emerged as a popular and efficient repulsion approach. SGNS repels each node from a sample of dissimilar nodes, as opposed to all dissimilar nodes. In this work, we show that node-wise repulsion is, in aggregate, an approximate re-centering of the node embedding dimensions. Such dimension operations are much more scalable than node operations. The dimension approach, in addition to being more efficient, yields a simpler geometric interpretation of the repulsion. Our result extends findings from the self-supervised learning literature to the skip-gram model, establishing a connection between skip-gram node contrast and dimension regularization. We show that in the limit of large graphs, under mild regularity conditions, the original node repulsion objective converges to optimization with dimension regularization. We use this observation to propose an algorithm augmentation framework that speeds up any existing algorithm, supervised or unsupervised, using SGNS. The framework prioritizes node attraction and replaces SGNS with dimension regularization. We instantiate this generic framework for LINE and node2vec and show that the augmented algorithms preserve downstream performance while dramatically increasing efficiency.
We propose a team assignment algorithm based on a hypergraph approach focusing on resilience and diffusion optimization. Specifically, our method is based on optimizing the algebraic connectivity of the Laplacian …
We propose a team assignment algorithm based on a hypergraph approach focusing on resilience and diffusion optimization. Specifically, our method is based on optimizing the algebraic connectivity of the Laplacian matrix of an edge-dependent vertex-weighted hypergraph. We used constrained simulated annealing, where we constrained the effort agents can exert to perform a task and the minimum effort a task requires to be completed. We evaluated our methods in terms of the number of unsuccessful patches to drive our solution into the feasible region and the cost of patching. We showed that our formulation provides more robust solutions than the original data and the greedy approach. We hope that our methods motivate further research in applying hypergraphs to similar problems in different research areas and in exploring variations of our methods.
The COVID-19 pandemic offers an unprecedented natural experiment providing insights into the emergence of collective behavioral changes of both exogenous (government mandated) and endogenous (spontaneous reaction to infection risks) origin. …
The COVID-19 pandemic offers an unprecedented natural experiment providing insights into the emergence of collective behavioral changes of both exogenous (government mandated) and endogenous (spontaneous reaction to infection risks) origin. Here, we characterize collective physical distancing—mobility reductions, minimization of contacts, shortening of contact duration—in response to the COVID-19 pandemic in the pre-vaccine era by analyzing de-identified, privacy-preserving location data for a panel of over 5.5 million anonymized, opted-in U.S. devices. We define five indicators of users’ mobility and proximity to investigate how the emerging collective behavior deviates from typical pre-pandemic patterns during the first nine months of the COVID-19 pandemic. We analyze both the dramatic changes due to the government mandated mitigation policies and the more spontaneous societal adaptation into a new (physically distanced) normal in the fall 2020. Using the indicators here defined we show that: a) during the COVID-19 pandemic, collective physical distancing displayed different phases and was heterogeneous across geographies, b) metropolitan areas displayed stronger reductions in mobility and contacts than rural areas; c) stronger reductions in commuting patterns are observed in geographical areas with a higher share of teleworkable jobs; d) commuting volumes during and after the lockdown period negatively correlate with unemployment rates; and e) increases in contact indicators correlate with future values of new deaths at a lag consistent with epidemiological parameters and surveillance reporting delays. In conclusion, this study demonstrates that the framework and indicators here presented can be used to analyze large-scale social distancing phenomena, paving the way for their use in future pandemics to analyze and monitor the effects of pandemic mitigation plans at the national and international levels.
Hypergraphs are a natural and powerful choice for modeling group interactions in the real world, which are often referred to as higher-order networks. For example, when modeling collaboration networks, where …
Hypergraphs are a natural and powerful choice for modeling group interactions in the real world, which are often referred to as higher-order networks. For example, when modeling collaboration networks, where collaborations can involve not just two but three or more people, employing hypergraphs allows us to explore beyond pairwise (dyadic) patterns and capture groupwise (polyadic) patterns. The mathematical complexity of hypergraphs offers both opportunities and challenges for learning and mining on hypergraphs, and hypergraph mining, which seeks to enhance our understanding of underlying systems through hypergraph modeling, gained increasing attention in research. Researchers have discovered various structural patterns in real-world hypergraphs, leading to the development of mining tools. Moreover, they have designed generators with the aim of reproducing and thereby shedding light on these patterns. In this survey, we provide a comprehensive overview of the current landscape of hypergraph mining, covering patterns, tools, and generators. We provide comprehensive taxonomies for them, and we also provide in-depth discussions to provide insights into future research on hypergraph mining.
Identifying shortest paths between nodes in a network is a common graph analysis problem that is important for many applications involving routing of resources. An adversary that can manipulate the …
Identifying shortest paths between nodes in a network is a common graph analysis problem that is important for many applications involving routing of resources. An adversary that can manipulate the graph structure could alter traffic patterns to gain some benefit (e.g., make more money by directing traffic to a toll road). This article presents the Force Path Cut problem, in which an adversary removes edges from a graph to make a particular path the shortest between its terminal nodes. We prove that the optimization version of this problem is APX-hard but introduce PATHATTACK , a polynomial-time approximation algorithm that guarantees a solution within a logarithmic factor of the optimal value. In addition, we introduce the Force Edge Cut and Force Node Cut problems, in which the adversary targets a particular edge or node, respectively, rather than an entire path. We derive a nonconvex optimization formulation for these problems and derive a heuristic algorithm that uses PATHATTACK as a subroutine. We demonstrate all of these algorithms on a diverse set of real and synthetic networks, illustrating where the proposed algorithms provide the greatest improvement over baseline methods.
The brain is a complex system comprising a myriad of interacting elements, posing significant challenges in understanding its structure, function, and dynamics. Network science has emerged as a powerful tool …
The brain is a complex system comprising a myriad of interacting elements, posing significant challenges in understanding its structure, function, and dynamics. Network science has emerged as a powerful tool for studying such intricate systems, offering a framework for integrating multiscale data and complexity. Here, we discuss the application of network science in the study of the brain, addressing topics such as network models and metrics, the connectome, and the role of dynamics in neural networks. We explore the challenges and opportunities in integrating multiple data streams for understanding the neural transitions from development to healthy function to disease, and discuss the potential for collaboration between network science and neuroscience communities. We underscore the importance of fostering interdisciplinary opportunities through funding initiatives, workshops, and conferences, as well as supporting students and postdoctoral fellows with interests in both disciplines. By uniting the network science and neuroscience communities, we can develop novel network-based methods tailored to neural circuits, paving the way towards a deeper understanding of the brain and its functions.
Abstract In this work, we explore multiplex graph (networks with different types of edges) generation with deep generative models. We discuss some of the challenges associated with multiplex graph generation …
Abstract In this work, we explore multiplex graph (networks with different types of edges) generation with deep generative models. We discuss some of the challenges associated with multiplex graph generation that make it a more difficult problem than traditional graph generation. We propose T en GAN, the first neural network for multiplex graph generation, which greatly reduces the number of parameters required for multiplex graph generation. We also propose 3 different criteria for evaluating the quality of generated graphs: a graph-attribute-based, a classifier-based, and a tensor-based method. We evaluate its performance on 4 datasets and show that it generally performs better than other existing statistical multiplex graph generative models. We also adapt HGEN, an existing deep generative model for heterogeneous information networks, to work for multiplex graphs and show that our method generally performs better.
Over the past decade, machine learning has revolutionized computers' ability to analyze text through flexible computational models. Due to their structural similarity to written language, transformer-based architectures have also shown …
Over the past decade, machine learning has revolutionized computers' ability to analyze text through flexible computational models. Due to their structural similarity to written language, transformer-based architectures have also shown promise as tools to make sense of a range of multi-variate sequences from protein-structures, music, electronic health records to weather-forecasts. We can also represent human lives in a way that shares this structural similarity to language. From one perspective, lives are simply sequences of events: People are born, visit the pediatrician, start school, move to a new location, get married, and so on. Here, we exploit this similarity to adapt innovations from natural language processing to examine the evolution and predictability of human lives based on detailed event sequences. We do this by drawing on arguably the most comprehensive registry data in existence, available for an entire nation of more than six million individuals across decades. Our data include information about life-events related to health, education, occupation, income, address, and working hours, recorded with day-to-day resolution. We create embeddings of life-events in a single vector space showing that this embedding space is robust and highly structured. Our models allow us to predict diverse outcomes ranging from early mortality to personality nuances, outperforming state-of-the-art models by a wide margin. Using methods for interpreting deep learning models, we probe the algorithm to understand the factors that enable our predictions. Our framework allows researchers to identify new potential mechanisms that impact life outcomes and associated possibilities for personalized interventions.
Suppose there is a spreading process such as an infectious disease propagating on a graph. How would we reduce the number of affected nodes in the spreading process? This question …
Suppose there is a spreading process such as an infectious disease propagating on a graph. How would we reduce the number of affected nodes in the spreading process? This question appears in recent studies about implementing mobility interventions on mobility networks (Chang et al. (2021)). A practical algorithm to reduce infections on unweighted graphs is to remove edges with the highest edge centrality score (Tong et al. (2012)), which is the product of two adjacent nodes' eigenscores. However, mobility networks have weighted edges; Thus, an intervention measure would involve edge-weight reduction besides edge removal. Motivated by this example, we revisit the problem of minimizing top eigenvalue(s) on weighted graphs by decreasing edge weights up to a fixed budget. We observe that the edge centrality score of Tong et al. (2012) is equal to the gradient of the largest eigenvalue of $WW^{\top}$, where $W$ denotes the weight matrix of the graph. We then present generalized edge centrality scores as the gradient of the sum of the largest $r$ eigenvalues of $WW^{\top}$. With this generalization, we design an iterative algorithm to find the optimal edge-weight reduction to shrink the largest $r$ eigenvalues of $WW^{\top}$ under a given edge-weight reduction budget. We also extend our algorithm and its guarantee to time-varying graphs, whose weights evolve over time. We perform a detailed empirical study to validate our approach. Our algorithm significantly reduces the number of infections compared with existing methods on eleven weighted networks. Further, we illustrate several properties of our algorithm, including the benefit of choosing the rank $r$, fast convergence to global optimum, and an almost linear runtime per iteration.
Suppose there is a spreading process propagating on a weighted graph. Denote the graph's weight matrix as W. How would we reduce the number of nodes affected during the process? …
Suppose there is a spreading process propagating on a weighted graph. Denote the graph's weight matrix as W. How would we reduce the number of nodes affected during the process? This question appears in recent studies about counterfactual outcomes of implementing edge-weight interventions on mobility networks (Chang et al. (2021)). A practical algorithm to reduce infections is by removing edges with the highest edge centrality, defined as the product of two adjacent nodes' eigen- scores (Tong et al. (2012)). In this work, we design edge-weight reduction algorithms on static and time- varying weighted networks with theoretical guarantees. First, we prove that edge centrality equals the gradient of the largest eigenvalue of WW⊤ (over W) and generalize the gradient for the largest r eigenvalues of WW⊤. Second, we design a Frank-Wolfe algorithm for finding the optimal edge-weight reduction to shrink the largest r eigenvalues of WW⊤ under any reduction budget. Third, we extend our algorithm to time-varying networks with guaranteed optimality. We perform a detailed empirical study to validate our approach. Our algorithm significantly reduces the number of infections compared with existing methods on eleven weighted networks. Further, we illustrate several properties of our algorithm: the benefit of choosing r, fast convergence to the optimum, and a linear-scale runtime per iteration.
The brain is a complex system comprising a myriad of interacting elements, posing significant challenges in understanding its structure, function, and dynamics. Network science has emerged as a powerful tool …
The brain is a complex system comprising a myriad of interacting elements, posing significant challenges in understanding its structure, function, and dynamics. Network science has emerged as a powerful tool for studying such intricate systems, offering a framework for integrating multiscale data and complexity. Here, we discuss the application of network science in the study of the brain, addressing topics such as network models and metrics, the connectome, and the role of dynamics in neural networks. We explore the challenges and opportunities in integrating multiple data streams for understanding the neural transitions from development to healthy function to disease, and discuss the potential for collaboration between network science and neuroscience communities. We underscore the importance of fostering interdisciplinary opportunities through funding initiatives, workshops, and conferences, as well as supporting students and postdoctoral fellows with interests in both disciplines. By uniting the network science and neuroscience communities, we can develop novel network-based methods tailored to neural circuits, paving the way towards a deeper understanding of the brain and its functions.
Identifying shortest paths between nodes in a network is an important task in applications involving routing of resources. Recent work has shown that a malicious actor can manipulate a graph …
Identifying shortest paths between nodes in a network is an important task in applications involving routing of resources. Recent work has shown that a malicious actor can manipulate a graph to make traffic between two nodes of interest follow their target path. In this paper, we develop a defense against such attacks by modifying the weights of the graph that users observe. The defender must balance inhibiting the attacker against any negative effects of the defense on benign users. Specifically, the defender's goals are: (a) to recommend the shortest paths possible to users, (b) for the lengths of the shortest paths in the published graph to be close to those of the same paths in the true graph, and (c) to minimize the probability of an attack. We formulate the defense as a Stackelberg game in which the defender is the leader and the attacker is the follower. In this context, we also consider a zero-sum version of the game, in which the defender's goal is to minimize cost while achieving the minimum possible attack probability. We show that this problem is NP-hard and propose heuristic solutions based on increasing edge weights along target paths in both the zero-sum and non-zero-sum settings. Relaxing some constraints of the original problem, we formulate a linear program for local optimization around a feasible point. We present defense results with both synthetic and real network datasets and show that these methods often reach the lower bound of the defender's cost.
The rise of large-scale socio-technical systems in which humans interact with artificial intelligence (AI) systems (including assistants and recommenders, in short AIs) multiplies the opportunity for the emergence of collective …
The rise of large-scale socio-technical systems in which humans interact with artificial intelligence (AI) systems (including assistants and recommenders, in short AIs) multiplies the opportunity for the emergence of collective phenomena and tipping points, with unexpected, possibly unintended, consequences. For example, navigation systems' suggestions may create chaos if too many drivers are directed on the same route, and personalised recommendations on social media may amplify polarisation, filter bubbles, and radicalisation. On the other hand, we may learn how to foster the "wisdom of crowds" and collective action effects to face social and environmental challenges. In order to understand the impact of AI on socio-technical systems and design next-generation AIs that team with humans to help overcome societal problems rather than exacerbate them, we propose to build the foundations of Social AI at the intersection of Complex Systems, Network Science and AI. In this perspective paper, we discuss the main open questions in Social AI, outlining possible technical and scientific challenges and suggesting research avenues.
Link prediction is a crucial task in graph machine learning with diverse applications. We explore the interplay between node attributes and graph topology and demonstrate that incorporating pre-trained node attributes …
Link prediction is a crucial task in graph machine learning with diverse applications. We explore the interplay between node attributes and graph topology and demonstrate that incorporating pre-trained node attributes improves the generalization power of link prediction models. Our proposed method, UPNA (Unsupervised Pre-training of Node Attributes), solves the inductive link prediction problem by learning a function that takes a pair of node attributes and predicts the probability of an edge, as opposed to Graph Neural Networks (GNN), which can be prone to topological shortcuts in graphs with power-law degree distribution. In this manner, UPNA learns a significant part of the latent graph generation mechanism since the learned function can be used to add incoming nodes to a growing graph. By leveraging pre-trained node attributes, we overcome observational bias and make meaningful predictions about unobserved nodes, surpassing state-of-the-art performance (3X to 34X improvement on benchmark datasets). UPNA can be applied to various pairwise learning tasks and integrated with existing link prediction models to enhance their generalizability and bolster graph generative models.
When dealing with large graphs, community detection is a useful data triage tool that can identify subsets of the network that a data analyst should investigate. In an adversarial scenario, …
When dealing with large graphs, community detection is a useful data triage tool that can identify subsets of the network that a data analyst should investigate. In an adversarial scenario, the graph may be manipulated to avoid scrutiny of certain nodes by the analyst. Robustness to such behavior is an important consideration for data analysts in high-stakes scenarios such as cyber defense and counterterrorism. In this paper, we evaluate the use of overlapping community detection methods in the presence of adversarial attacks aimed at lowering the priority of a specific vertex. We formulate the data analyst's choice as a Stackelberg game in which the analyst chooses a community detection method and the attacker chooses an attack strategy in response. Applying various attacks from the literature to seven real network datasets, we find that, when the attacker has a sufficient budget, overlapping community detection methods outperform non-overlapping methods, often overwhelmingly so. This is the case when the attacker can only add edges that connect to the target and when the capability is added to add edges between neighbors of the target. We also analyze the tradeoff between robustness in the presence of an attack and performance when there is no attack. Our extensible analytic framework enables network data analysts to take these considerations into account and incorporate new attacks and community detection methods as they are developed.
Vertex classification -- the problem of identifying the class labels of nodes in a graph -- has applicability in a wide variety of domains. Examples include classifying subject areas of …
Vertex classification -- the problem of identifying the class labels of nodes in a graph -- has applicability in a wide variety of domains. Examples include classifying subject areas of papers in citation networks or roles of machines in a computer network. Vertex classification using graph convolutional networks is susceptible to targeted poisoning attacks, in which both graph structure and node attributes can be changed in an attempt to misclassify a target node. This vulnerability decreases users' confidence in the learning method and can prevent adoption in high-stakes contexts. Defenses have also been proposed, focused on filtering edges before creating the model or aggregating information from neighbors more robustly. This paper considers an alternative: we leverage network characteristics in the training data selection process to improve robustness of vertex classifiers. We propose two alternative methods of selecting training data: (1) to select the highest-degree nodes and (2) to iteratively select the node with the most neighbors minimally connected to the training set. In the datasets on which the original attack was demonstrated, we show that changing the training set can make the network much harder to attack. To maintain a given probability of attack success, the adversary must use far more perturbations; often a factor of 2--4 over the random training baseline. These training set selection methods often work in conjunction with the best recently published defenses to provide even greater robustness. While increasing the amount of randomly selected training data sometimes results in a more robust classifier, the proposed methods increase robustness substantially more. We also run a simulation study in which we demonstrate conditions under which each of the two methods outperforms the other, controlling for the graph topology, homophily of the labels, and node attributes.
Machine learning (ML) approaches are increasingly being used to accelerate combinatorial optimization (CO) problems. We look specifically at the Set Cover Problem (SCP) and propose Graph-SCP, a graph neural network …
Machine learning (ML) approaches are increasingly being used to accelerate combinatorial optimization (CO) problems. We look specifically at the Set Cover Problem (SCP) and propose Graph-SCP, a graph neural network method that can augment existing optimization solvers by learning to identify a much smaller sub-problem that contains the solution space. We evaluate the performance of Graph-SCP on synthetic weighted and unweighted SCP instances with diverse problem characteristics and complexities, and on instances from the OR Library, a canonical benchmark for SCP. We show that Graph-SCP reduces the problem size by 30-70% and achieves run time speedups up to~25x when compared to commercial solvers (Gurobi). Given a desired optimality threshold, Graph-SCP will improve upon it or even achieve 100% optimality. This is in contrast to fast greedy solutions that significantly compromise solution quality to achieve guaranteed polynomial run time. Graph-SCP can generalize to larger problem sizes and can be used with other conventional or ML-augmented CO solvers to lead to potential additional run time improvement.
Recent advances in machine learning (ML) have shown promise in aiding and accelerating classical combinatorial optimization algorithms. ML-based speed ups that aim to learn in an end to end manner …
Recent advances in machine learning (ML) have shown promise in aiding and accelerating classical combinatorial optimization algorithms. ML-based speed ups that aim to learn in an end to end manner (i.e., directly output the solution) tend to trade off run time with solution quality. Therefore, solutions that are able to accelerate existing solvers while maintaining their performance guarantees, are of great interest. We consider an APX-hard problem, where an adversary aims to attack shortest paths in a graph by removing the minimum number of edges. We propose the GRASP algorithm: Graph Attention Accelerated Shortest Path Attack, an ML aided optimization algorithm that achieves run times up to 10x faster, while maintaining the quality of solution generated. GRASP uses a graph attention network to identify a smaller subgraph containing the combinatorial solution, thus effectively reducing the input problem size. Additionally, we demonstrate how careful representation of the input graph, including node features that correlate well with the optimization task, can highlight important structure in the optimization solution.
We study the fairness of dimensionality reduction methods for recommendations. We focus on the established method of principal component analysis (PCA), which identifies latent components and produces a low-rank approximation …
We study the fairness of dimensionality reduction methods for recommendations. We focus on the established method of principal component analysis (PCA), which identifies latent components and produces a low-rank approximation via the leading components while discarding the trailing components. Prior works have defined notions of "fair PCA"; however, these definitions do not answer the following question: what makes PCA unfair? We identify two underlying mechanisms of PCA that induce unfairness at the item level. The first negatively impacts less popular items, due to the fact that less popular items rely on trailing latent components to recover their values. The second negatively impacts the highly popular items, since the leading PCA components specialize in individual popular items instead of capturing similarities between items. To address these issues, we develop a polynomial-time algorithm, Item-Weighted PCA, a modification of PCA that uses item-specific weights in the objective. On a stylized class of matrices, we prove that Item-Weighted PCA using a specific set of weights minimizes a popularity-normalized error metric. Our evaluations on real-world datasets show that Item-Weighted PCA not only improves overall recommendation quality by up to $0.1$ item-level AUC-ROC but also improves on both popular and less popular items.
Scalable addressing of high dimensional constrained combinatorial optimization problems is a challenge that arises in several science and engineering disciplines. Recent work introduced novel application of graph neural networks for …
Scalable addressing of high dimensional constrained combinatorial optimization problems is a challenge that arises in several science and engineering disciplines. Recent work introduced novel application of graph neural networks for solving polynomial-cost unconstrained combinatorial optimization problems. This paper proposes a new framework, called HypOp, which greatly advances the state of the art for solving combinatorial optimization problems in several aspects: (i) it generalizes the prior results to constrained optimization problems with an arbitrary cost function; (ii) it broadens the application to higher dimensional problems by leveraging a hypergraph neural network structure; (iii) it enables scalability to much larger problems by introducing a new distributed and parallel architecture for hypergraph neural network training; (iv) it demonstrates generalizability to other problem formulations by knowledge transfer from the learned experience of addressing one set of cost/constraints to another set for the same hypergraph; (v) it significantly boosts the solution accuracy compared with the prior art by suggesting a fine-tuning step using simulated annealing; (vi) HypOp shows a remarkable progress on benchmark examples, with run times improved by up to fivefold using a combination of fine-tuning and distributed training techniques. The framework allows addressing a novel set of scientific problems including hypergraph MaxCut problem, satisfiability problems (3SAT), and resource allocation. We showcase the application of HypOp in scientific discovery by solving a hypergraph MaxCut problem on the NDC drug-substance hypergraph. Through extensive experimentation on a variety of combinatorial optimization problems, HypOp demonstrates superiority over existing unsupervised learning-based solvers and generic optimization methods.
The issue of bias (i.e., systematic unfairness) in machine learning models has recently attracted the attention of both researchers and practitioners. For the graph mining community in particular, an important …
The issue of bias (i.e., systematic unfairness) in machine learning models has recently attracted the attention of both researchers and practitioners. For the graph mining community in particular, an important goal toward algorithmic fairness is to detect and mitigate bias incorporated into graph embeddings since they are commonly used in human-centered applications, e.g., social-media recommendations. However, simple analytical methods for detecting bias typically involve aggregate statistics which do not reveal the sources of unfairness. Instead, visual methods can provide a holistic fairness characterization of graph embeddings and help uncover the causes of observed bias. In this work, we present BIAS <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">COPE</inf> , an interactive visualization tool that supports end-to-end visual unfairness diagnosis for graph embeddings. The tool is the product of a design study in collaboration with domain experts. It allows the user to (i) visually compare two embeddings with respect to fairness, (ii) locate nodes or graph communities that are unfairly embedded, and (iii) understand the source of bias by interactively linking the relevant embedding subspace with the corresponding graph topology. Experts' feedback confirms that our tool is effective at detecting and diagnosing unfairness. Thus, we envision our tool both as a companion for researchers in designing their algorithms as well as a guide for practitioners who use off-the-shelf graph embeddings.
Abstract The structure of complex networks can be characterized by counting and analysing network motifs. Motifs are small graph structures that occur repeatedly in a network, such as triangles or …
Abstract The structure of complex networks can be characterized by counting and analysing network motifs. Motifs are small graph structures that occur repeatedly in a network, such as triangles or chains. Recent work has generalized motifs to temporal and dynamic network data. However, existing techniques do not generalize to sequential or trajectory data, which represent entities moving through the nodes of a network, such as passengers moving through transportation networks. The unit of observation in these data is fundamentally different since we analyse observations of trajectories (e.g. a trip from airport A to airport C through airport B), rather than independent observations of edges or snapshots of graphs over time. In this work, we define sequential motifs in trajectory data, which are small, directed and sequence-ordered graphs corresponding to patterns in observed sequences. We draw a connection between the counting and analysis of sequential motifs and Higher-Order Network (HON) models. We show that by mapping edges of a HON, specifically a $k$th-order DeBruijn graph, to sequential motifs, we can count and evaluate their importance in observed data. We test our methodology with two datasets: (1) passengers navigating an airport network and (2) people navigating the Wikipedia article network. We find that the most prevalent and important sequential motifs correspond to intuitive patterns of traversal in the real systems and show empirically that the heterogeneity of edge weights in an observed higher-order DeBruijn graph has implications for the distributions of sequential motifs we expect to see across our null models.
Are the embeddings of a graph's degenerate core stable? What happens to the embeddings of nodes in the degenerate core as we systematically remove periphery nodes (by repeated peeling off …
Are the embeddings of a graph's degenerate core stable? What happens to the embeddings of nodes in the degenerate core as we systematically remove periphery nodes (by repeated peeling off $k$-cores)? We discover three patterns w.r.t. instability in degenerate-core embeddings across a variety of popular graph embedding algorithms and datasets. We use regression to quantify the change point in graph embedding stability. Furthermore, we present the STABLE algorithm, which takes an existing graph embedding algorithm and makes it stable. We show the effectiveness of STABLE in terms of making the degenerate-core embedding stable and still producing state-of-the-art link prediction performance.
The cyber-threat landscape has evolved tremendously in recent years, with new threat variants emerging daily, and large-scale coordinated campaigns becoming more prevalent. In this study, we propose CELEST (CollaborativE LEarning …
The cyber-threat landscape has evolved tremendously in recent years, with new threat variants emerging daily, and large-scale coordinated campaigns becoming more prevalent. In this study, we propose CELEST (CollaborativE LEarning for Scalable Threat detection, a federated machine learning framework for global threat detection over HTTP, which is one of the most commonly used protocols for malware dissemination and communication. CELEST leverages federated learning in order to collaboratively train a global model across multiple clients who keep their data locally, thus providing increased privacy and confidentiality assurances. Through a novel active learning component integrated with the federated learning technique, our system continuously discovers and learns the behavior of new, evolving, and globally-coordinated cyber threats. We show that CELEST is able to expose attacks that are largely invisible to individual organizations. For instance, in one challenging attack scenario with data exfiltration malware, the global model achieves a three-fold increase in Precision-Recall AUC compared to the local model. We also design a poisoning detection and mitigation method, DTrust, specifically designed for federated learning in the collaborative threat detection domain. DTrust successfully detects poisoning clients using the feedback from participating clients to investigate and remove them from the training process. We deploy CELEST on two university networks and show that it is able to detect the malicious HTTP communication with high precision and low false positive rates. Furthermore, during its deployment, CELEST detected a set of previously unknown 42 malicious URLs and 20 malicious domains in one day, which were confirmed to be malicious by VirusTotal.
Self-propagating malware (SPM) has recently resulted in large financial losses and high social impact, with well-known campaigns such as WannaCry and Colonial Pipeline being able to propagate rapidly on the …
Self-propagating malware (SPM) has recently resulted in large financial losses and high social impact, with well-known campaigns such as WannaCry and Colonial Pipeline being able to propagate rapidly on the Internet and cause service disruptions. To date, the propagation behavior of SPM is still not well understood, resulting in the difficulty of defending against these cyber threats. To address this gap, in this paper we perform a comprehensive analysis of a newly proposed epidemiological model for SPM propagation, Susceptible-Infected-Infected Dormant-Recovered (SIIDR). We perform a theoretical analysis of the stability of the SIIDR model and derive its basic reproduction number by representing it as a system of Ordinary Differential Equations with continuous time. We obtain access to 15 WananCry attack traces generated under various conditions, derive the model's transition rates, and show that SIIDR fits best the real data. We find that the SIIDR model outperforms more established compartmental models from epidemiology, such as SI, SIS, and SIR, at modeling SPM propagation.
Identifying shortest paths between nodes in a network is a common graph analysis problem that is important for many applications involving routing of resources. An adversary that can manipulate the …
Identifying shortest paths between nodes in a network is a common graph analysis problem that is important for many applications involving routing of resources. An adversary that can manipulate the graph structure could alter traffic patterns to gain some benefit (e.g., make more money by directing traffic to a toll road). This paper presents the Force Path Cut problem, in which an adversary removes edges from a graph to make a particular path the shortest between its terminal nodes. We prove that this problem is APX-hard, but introduce PATHATTACK, a polynomial-time approximation algorithm that guarantees a solution within a logarithmic factor of the optimal value. In addition, we introduce the Force Edge Cut and Force Node Cut problems, in which the adversary targets a particular edge or node, respectively, rather than an entire path. We derive a nonconvex optimization formulation for these problems, and derive a heuristic algorithm that uses PATHATTACK as a subroutine. We demonstrate all of these algorithms on a diverse set of real and synthetic networks, illustrating the network types that benefit most from the proposed algorithms.
The COVID-19 pandemic offers an unprecedented natural experiment providing insights into the emergence of collective behavioral changes of both exogenous (government mandated) and endogenous (spontaneous reaction to infection risks) origin. …
The COVID-19 pandemic offers an unprecedented natural experiment providing insights into the emergence of collective behavioral changes of both exogenous (government mandated) and endogenous (spontaneous reaction to infection risks) origin. Here, we characterize collective physical distancing -- mobility reductions, minimization of contacts, shortening of contact duration -- in response to the COVID-19 pandemic in the pre-vaccine era by analyzing de-identified, privacy-preserving location data for a panel of over 5.5 million anonymized, opted-in U.S. devices. We define five indicators of users' mobility and proximity to investigate how the emerging collective behavior deviates from the typical pre-pandemic patterns during the first nine months of the COVID-19 pandemic. We analyze both the dramatic changes due to the government mandated mitigation policies and the more spontaneous societal adaptation into a new (physically distanced) normal in the fall 2020. The indicators defined here allow the quantification of behavior changes across the rural/urban divide and highlight the statistical association of mobility and proximity indicators with metrics characterizing the pandemic's social and public health impact such as unemployment and deaths. This study provides a framework to study massive social distancing phenomena with potential uses in analyzing and monitoring the effects of pandemic mitigation plans at the national and international level.
The issue of bias (i.e., systematic unfairness) in machine learning models has recently attracted the attention of both researchers and practitioners. For the graph mining community in particular, an important …
The issue of bias (i.e., systematic unfairness) in machine learning models has recently attracted the attention of both researchers and practitioners. For the graph mining community in particular, an important goal toward algorithmic fairness is to detect and mitigate bias incorporated into graph embeddings since they are commonly used in human-centered applications, e.g., social-media recommendations. However, simple analytical methods for detecting bias typically involve aggregate statistics which do not reveal the sources of unfairness. Instead, visual methods can provide a holistic fairness characterization of graph embeddings and help uncover the causes of observed bias. In this work, we present BiaScope, an interactive visualization tool that supports end-to-end visual unfairness diagnosis for graph embeddings. The tool is the product of a design study in collaboration with domain experts. It allows the user to (i) visually compare two embeddings with respect to fairness, (ii) locate nodes or graph communities that are unfairly embedded, and (iii) understand the source of bias by interactively linking the relevant embedding subspace with the corresponding graph topology. Experts' feedback confirms that our tool is effective at detecting and diagnosing unfairness. Thus, we envision our tool both as a companion for researchers in designing their algorithms as well as a guide for practitioners who use off-the-shelf graph embeddings.
Recent self-propagating malware (SPM) campaigns compromised hundred of thousands of victim machines on the Internet. It is challenging to detect these attacks in their early stages, as adversaries utilize common …
Recent self-propagating malware (SPM) campaigns compromised hundred of thousands of victim machines on the Internet. It is challenging to detect these attacks in their early stages, as adversaries utilize common network services, use novel techniques, and can evade existing detection mechanisms. We propose PORTFILER (PORT-Level Network Traffic ProFILER), a new machine learning system applied to network traffic for detecting SPM attacks. PORTFILER extracts port-level features from the Zeek connection logs collected at a border of a monitored network, applies anomaly detection techniques to identify suspicious events, and ranks the alerts across ports for investigation by the Security Operations Center (SOC). We propose a novel ensemble methodology for aggregating individual models in PORTFILER that increases resilience against several evasion strategies compared to standard ML baselines. We extensively evaluate PORTFILER on traffic collected from two university networks, and show that it can detect SPM attacks with different patterns, such as WannaCry and Mirai, and performs well under evasion. Ranking across ports achieves precision over 0.94 with low false positive rates in the top ranked alerts. When deployed on the university networks, PORTFILER detected anomalous SPM-like activity on one of the campus networks, confirmed by the university SOC as malicious. PORTFILER also detected a Mirai attack recreated on the two university networks with higher precision and recall than deep-learning-based autoencoder methods.
The structure of complex networks can be characterized by counting and analyzing network motifs. Motifs are small subgraphs that occur repeatedly in a network, such as triangles or chains. Recent …
The structure of complex networks can be characterized by counting and analyzing network motifs. Motifs are small subgraphs that occur repeatedly in a network, such as triangles or chains. Recent work has generalized motifs to temporal and dynamic network data. However, existing techniques do not generalize to sequential or trajectory data, which represents entities moving through the nodes of a network, such as passengers moving through transportation networks. The unit of observation in these data is fundamentally different, since we analyze full observations of trajectories (e.g., a trip from airport A to airport C through airport B), rather than independent observations of edges or snapshots of graphs over time. In this work, we define sequential motifs in trajectory data, which are small, directed, and edge-weighted subgraphs corresponding to patterns in observed sequences. We draw a connection between counting and analysis of sequential motifs and Higher-Order Network (HON) models. We show that by mapping edges of a HON, specifically a $k$th-order DeBruijn graph, to sequential motifs, we can count and evaluate their importance in observed data. We test our methodology with two datasets: (1) passengers navigating an airport network and (2) people navigating the Wikipedia article network. We find that the most prevalent and important sequential motifs correspond to intuitive patterns of traversal in the real systems, and show empirically that the heterogeneity of edge weights in an observed higher-order DeBruijn graph has implications for the distributions of sequential motifs we expect to see across our null models.
Recent self-propagating malware (SPM) campaigns compromised hundred of thousands of victim machines on the Internet. It is challenging to detect these attacks in their early stages, as adversaries utilize common …
Recent self-propagating malware (SPM) campaigns compromised hundred of thousands of victim machines on the Internet. It is challenging to detect these attacks in their early stages, as adversaries utilize common network services, use novel techniques, and can evade existing detection mechanisms. We propose PorTFILER (PORT-Level Network Traffic ProFILER), a new machine learning system applied to network traffic for detecting SPM attacks. PORTFILER extracts port-level features from the Zeek connection logs collected at a border of a monitored network, applies anomaly detection techniques to identify suspicious events, and ranks the alerts across ports for investigation by the Security Operations Center (SOC). We propose a novel ensemble methodology for aggregating individual models in PORTFILER that increases resilience against several evasion strategies compared to standard ML baselines. We extensively evaluate PorTFILER on traffic collected from two university networks, and show that it can detect SPM attacks with different patterns, such as WannaCry and Mirai, and performs well under evasion. Ranking across ports achieves precision over 0.94 and false positive rates below $8 \times 10^{-4}$ in the top 100 highly ranked alerts. When deployed on the university networks, PorTFILER detected anomalous SPM-like activity on one of the campus networks, confirmed by the university SOC as malicious. PortFILER also detected a Mirai attack recreated on the two university networks with higher precision and recall than deep-learning based autoencoder methods.
This is the fifth in a series of white papers providing a summary of the discussions and future directions that are derived from these topical meetings. This paper focuses on …
This is the fifth in a series of white papers providing a summary of the discussions and future directions that are derived from these topical meetings. This paper focuses on issues related to analysis and visual analytics. While these two topics are distinct, there are clear overlaps between the two. It is common to use different visualizations during analysis and given the sheer volume of social media data, visual analytic tools can be important during analysis, as well as during other parts of the research lifecycle. Choices about analysis may be informed by visualization plans and vice versa - both are key in communicating about a data set and what it means. We also recognized that each field of research has different analysis techniques and different levels of familiarity with visual analytics. Putting these two topics into the same meeting provided us with the opportunity to think about analysis and visual analytics/visualization in new, synergistic ways.
We present RAWLSNET, a system for altering Bayesian Network (BN) models to satisfy the Rawlsian principle of fair equality of opportunity (FEO). RAWLSNET's BN models generate aspirational data distributions: data …
We present RAWLSNET, a system for altering Bayesian Network (BN) models to satisfy the Rawlsian principle of fair equality of opportunity (FEO). RAWLSNET's BN models generate aspirational data distributions: data generated to reflect an ideally fair, FEO-satisfying society. FEO states that everyone with the same talent and willingness to use it should have the same chance of achieving advantageous social positions (e.g., employment), regardless of their background circumstances (e.g., socioeconomic status). Satisfying FEO requires alterations to social structures such as school assignments. Our paper describes RAWLSNET, a method which takes as input a BN representation of an FEO application and alters the BN's parameters so as to satisfy FEO when possible, and minimize deviation from FEO otherwise. We also offer guidance for applying RAWLSNET, including on recognizing proper applications of FEO. We demonstrate the use of RAWLSNET with publicly available data sets. RAWLSNET's altered BNs offer the novel capability of generating aspirational data for FEO-relevant tasks. Aspirational data are free from biases of real-world data, and thus are useful for recognizing and detecting sources of unfairness in machine learning algorithms besides biased data.
Shortest paths in complex networks play key roles in many applications. Examples include routing packets in a computer network, routing traffic on a transportation network, and inferring semantic distances between …
Shortest paths in complex networks play key roles in many applications. Examples include routing packets in a computer network, routing traffic on a transportation network, and inferring semantic distances between concepts on the World Wide Web. An adversary with the capability to perturb the graph might make the shortest path between two nodes route traffic through advantageous portions of the graph (e.g., a toll road he owns). In this paper, we introduce the Force Path Cut problem, in which there is a specific route the adversary wants to promote by removing a minimum number of edges in the graph. We show that Force Path Cut is NP-complete, but also that it can be recast as an instance of the Weighted Set Cover problem, enabling the use of approximation algorithms. The size of the universe for the set cover problem is potentially factorial in the number of nodes. To overcome this hurdle, we propose the PATHATTACK algorithm, which via constraint generation considers only a small subset of paths -- at most 5% of the number of edges in 99% of our experiments. Across a diverse set of synthetic and real networks, the linear programming formulation of Weighted Set Cover yields the optimal solution in over 98% of cases. We also demonstrate a time/cost tradeoff using two approximation algorithms and greedy baseline methods. This work provides a foundation for addressing similar problems and expands the area of adversarial graph mining beyond recent work on node classification and embedding.
Abstract Complex networks are often either too large for full exploration, partially accessible, or partially observed. Downstream learning tasks on these incomplete networks can produce low quality results. In addition, …
Abstract Complex networks are often either too large for full exploration, partially accessible, or partially observed. Downstream learning tasks on these incomplete networks can produce low quality results. In addition, reducing the incompleteness of the network can be costly and nontrivial. As a result, network discovery algorithms optimized for specific downstream learning tasks given resource collection constraints are of great interest. In this paper, we formulate the task-specific network discovery problem as a sequential decision-making problem. Our downstream task is selective harvesting, the optimal collection of vertices with a particular attribute. We propose a framework, called network actor critic (NAC), which learns a policy and notion of future reward in an offline setting via a deep reinforcement learning algorithm. The NAC paradigm utilizes a task-specific network embedding to reduce the state space complexity. A detailed comparative analysis of popular network embeddings is presented with respect to their role in supporting offline planning. Furthermore, a quantitative study is presented on various synthetic and real benchmarks using NAC and several baselines. We show that offline models of reward and network discovery policies lead to significantly improved performance when compared to competitive online discovery algorithms. Finally, we outline learning regimes where planning is critical in addressing sparse and changing reward signals.
We present RAWLSNET, a system for altering Bayesian Network (BN) models to satisfy the Rawlsian principle of fair equality of opportunity (FEO). RAWLSNET's BN models generate aspirational data distributions: data …
We present RAWLSNET, a system for altering Bayesian Network (BN) models to satisfy the Rawlsian principle of fair equality of opportunity (FEO). RAWLSNET's BN models generate aspirational data distributions: data generated to reflect an ideally fair, FEO-satisfying society. FEO states that everyone with the same talent and willingness to use it should have the same chance of achieving advantageous social positions (e.g., employment), regardless of their background circumstances (e.g., socioeconomic status). Satisfying FEO requires alterations to social structures such as school assignments. Our paper describes RAWLSNET, a method which takes as input a BN representation of an FEO application and alters the BN's parameters so as to satisfy FEO when possible, and minimize deviation from FEO otherwise. We also offer guidance for applying RAWLSNET, including on recognizing proper applications of FEO. We demonstrate the use of our system with publicly available data sets. RAWLSNET's altered BNs offer the novel capability of generating aspirational data for FEO-relevant tasks. Aspirational data are free from the biases of real-world data, and thus are useful for recognizing and detecting sources of unfairness in machine learning algorithms besides biased data.
Recently, coordinated attack campaigns started to become more widespread on the Internet. In May 2017, WannaCry infected more than 300,000 machines in 150 countries in a few days and had …
Recently, coordinated attack campaigns started to become more widespread on the Internet. In May 2017, WannaCry infected more than 300,000 machines in 150 countries in a few days and had a large impact on critical infrastructure. Existing threat sharing platforms cannot easily adapt to emerging attack patterns. At the same time, enterprises started to adopt machine learning-based threat detection tools in their local networks. In this paper, we pose the question: \emph{What information can defenders share across multiple networks to help machine learning-based threat detection adapt to new coordinated attacks?} We propose three information sharing methods across two networks, and show how the shared information can be used in a machine-learning network-traffic model to significantly improve its ability of detecting evasive self-propagating malware.
Related DatabasesWeb of Science You must be logged in with an active subscription to view this.Article DataHistorySubmitted: 13 July 2020Accepted: 12 January 2021Published online: 13 May 2021Keywordsnonbacktracking, node immunization, centrality …
Related DatabasesWeb of Science You must be logged in with an active subscription to view this.Article DataHistorySubmitted: 13 July 2020Accepted: 12 January 2021Published online: 13 May 2021Keywordsnonbacktracking, node immunization, centrality measure, complex networksAMS Subject Headings05C50, 05C82, 05C85, 68R10Publication DataISSN (online): 2577-0187Publisher: Society for Industrial and Applied MathematicsCODEN: sjmdaq
We present RAWLSNET, a system for altering Bayesian Network (BN) models to satisfy the Rawlsian principle of fair equality of opportunity (FEO). RAWLSNET's BN models generate aspirational data distributions: data …
We present RAWLSNET, a system for altering Bayesian Network (BN) models to satisfy the Rawlsian principle of fair equality of opportunity (FEO). RAWLSNET's BN models generate aspirational data distributions: data generated to reflect an ideally fair, FEO-satisfying society. FEO states that everyone with the same talent and willingness to use it should have the same chance of achieving advantageous social positions (e.g., employment), regardless of their background circumstances (e.g., socioeconomic status). Satisfying FEO requires alterations to social structures such as school assignments. Our paper describes RAWLSNET, a method which takes as input a BN representation of an FEO application and alters the BN's parameters so as to satisfy FEO when possible, and minimize deviation from FEO otherwise. We also offer guidance for applying RAWLSNET, including on recognizing proper applications of FEO. We demonstrate the use of our system with publicly available data sets. RAWLSNET's altered BNs offer the novel capability of generating aspirational data for FEO-relevant tasks. Aspirational data are free from the biases of real-world data, and thus are useful for recognizing and detecting sources of unfairness in machine learning algorithms besides biased data.
Finding shortest paths in a given network (e.g., a computer network or a road network) is a well-studied task with many applications. We consider this task under the presence of …
Finding shortest paths in a given network (e.g., a computer network or a road network) is a well-studied task with many applications. We consider this task under the presence of an adversary, who can manipulate the network by perturbing its edge weights to gain an advantage over others. Specifically, we introduce the Force Path Problem as follows. Given a network, the adversary's goal is to make a specific path the shortest by adding weights to edges in the network. The version of this problem in which the adversary can cut edges is NP-complete. However, we show that Force Path can be solved to within arbitrary numerical precision in polynomial time. We propose the PATHPERTURB algorithm, which uses constraint generation to build a set of constraints that require paths other than the adversary's target to be sufficiently long. Across a highly varied set of synthetic and real networks, we show that the optimal solution often reduces the required perturbation budget by about half when compared to a greedy baseline method.