Author Description

Login to generate an author description

Ask a Question About This Mathematician

In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and … In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and 3D object detection as well as ground estimation and depth completion. Our experiments show that all these tasks are complementary and help the network learn better representations by fusing information at various levels. Importantly, our approach leads the KITTI benchmark on 2D, 3D and bird's eye view object detection, while being real-time.
In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design … In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolving the conflicts between semantic and instance segmentation. Besides, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves state-of-the-art performance with much faster inference. Code has been made available at: https://github.com/uber-research/UPSNet.
Our goal is to significantly speed up the runtime of current state-of-the-art stereo algorithms to enable real-time inference. Towards this goal, we developed a differentiable PatchMatch module that allows us … Our goal is to significantly speed up the runtime of current state-of-the-art stereo algorithms to enable real-time inference. Towards this goal, we developed a differentiable PatchMatch module that allows us to discard most disparities without requiring full cost volume evaluation. We then exploit this representation to learn which range to prune for each pixel. By progressively reducing the search space and effectively propagating such information, we are able to efficiently compute the cost volume for high likelihood hypotheses and achieve savings in both memory and computation.Finally, an image guided refinement module is exploited to further improve the performance. Since all our components are differentiable, the full network can be trained end-to-end. Our experiments show that our method achieves competitive results on KITTI and SceneFlow datasets while running in real-time at 62ms.
In this paper, we propose PolyTransform, a novel instance segmentation algorithm that produces precise, geometry-preserving masks by combining the strengths of prevailing segmentation approaches and modern polygon-based methods. In particular, … In this paper, we propose PolyTransform, a novel instance segmentation algorithm that produces precise, geometry-preserving masks by combining the strengths of prevailing segmentation approaches and modern polygon-based methods. In particular, we first exploit a segmentation network to generate instance masks. We then convert the masks into a set of polygons that are then fed to a deforming network that transforms the polygons such that they better fit the object boundaries. Our experiments on the challenging Cityscapes dataset show that our PolyTransform significantly improves the performance of the backbone instance segmentation network and ranks 1st on the Cityscapes test-set leaderboard. We also show impressive gains in the interactive annotation setting.
We tackle the problem of joint perception and motion forecasting in the context of self-driving vehicles. Towards this goal we propose PnPNet, an end-to-end model that takes as input sequential … We tackle the problem of joint perception and motion forecasting in the context of self-driving vehicles. Towards this goal we propose PnPNet, an end-to-end model that takes as input sequential sensor data, and outputs at each time step object tracks and their future trajectories. The key component is a novel tracking module that generates object tracks online from detections and exploits trajectory level features for motion forecasting. Specifically, the object tracks get updated at each time step by solving both the data association problem and the trajectory estimation problem. Importantly, the whole model is end-to-end trainable and benefits from joint optimization of all tasks. We validate PnPNet on two large-scale driving datasets, and show significant improvements over the state-of-the-art with better occlusion recovery and more accurate future prediction.
In this paper we tackle the problem of scene flow estimation in the context of self-driving. We leverage deep learning techniques as well as strong priors as in our application … In this paper we tackle the problem of scene flow estimation in the context of self-driving. We leverage deep learning techniques as well as strong priors as in our application domain the motion of the scene can be composed by the motion of the robot and the 3D motion of the actors in the scene. We formulate the problem as energy minimization in a deep structured model, which can be solved efficiently in the GPU by unrolling a Gaussian-Newton solver. Our experiments in the challenging KITTI scene flow dataset show that we outperform the state-of-the-art by a very large margin, while being 800 times faster.
Alternating Direction Method of Multipliers (ADMM) is a widely used tool for machine learning in distributed settings, where a machine learning model is trained over distributed data sources through an … Alternating Direction Method of Multipliers (ADMM) is a widely used tool for machine learning in distributed settings, where a machine learning model is trained over distributed data sources through an interactive process of local computation and message passing. Such an iterative process could cause privacy concerns of data owners. The goal of this paper is to provide differential privacy for ADMM-based distributed machine learning. Prior approaches on differentially private ADMM exhibit low utility under high privacy guarantee and often assume the objective functions of the learning problems to be smooth and strongly convex. To address these concerns, we propose a novel differentially private ADMM-based distributed learning algorithm called DP-ADMM, which combines an approximate augmented Lagrangian function with time-varying Gaussian noise addition in the iterative process to achieve higher utility for general objective functions under the same differential privacy guarantee. We also apply the moments accountant method to bound the end-to-end privacy loss. The theoretical analysis shows that DP-ADMM can be applied to a wider class of distributed learning problems, is provably convergent, and offers an explicit utility-privacy tradeoff. To our knowledge, this is the first paper to provide explicit convergence and utility properties for differentially private ADMM-based distributed learning algorithms. The evaluation results demonstrate that our approach can achieve good convergence and model accuracy under high end-to-end differential privacy guarantee.
Federated Learning rests on the notion of training a global model distributedly on various devices. Under this setting, users' devices perform computations on their own data and then share the … Federated Learning rests on the notion of training a global model distributedly on various devices. Under this setting, users' devices perform computations on their own data and then share the results with the cloud server to update the global model. A fundamental issue in such systems is to effectively incentivize user participation. The users suffer from privacy leakage of their local data during the federated model training process. Without well-designed incentives, self-interested users will be unwilling to participate in federated learning tasks and contribute their private data. To bridge this gap, in this paper, we adopt the game theory to design an effective incentive mechanism, which selects users that are most likely to provide reliable data and compensates for their costs of privacy leakage. We formulate our problem as a two-stage Stackelberg game and solve the game's equilibrium. Effectiveness of the proposed mechanism is demonstrated by extensive simulations.
We study the growth rate of Lp norms of eigenfunctions of the Laplace-Beltrami operator restricted to submanifolds of compact C∞ Riemannian manifolds. The spectral projection operators can be expressed as … We study the growth rate of Lp norms of eigenfunctions of the Laplace-Beltrami operator restricted to submanifolds of compact C∞ Riemannian manifolds. The spectral projection operators can be expressed as oscillatory integral operators, so the question reduces to oscillatory integral operator norm estimates. We study the relationship between the extrinsic geometry of these submanifolds and the canonical relations associated to these oscillatory integral operators.
This paper addresses the aerodynamic performance of Busemann type supersonic biplane at off-design conditions. An adjoint based optimization technique is used to optimize the aerodynamic shape of the biplane to … This paper addresses the aerodynamic performance of Busemann type supersonic biplane at off-design conditions. An adjoint based optimization technique is used to optimize the aerodynamic shape of the biplane to reduce the wave drag at a series of Mach numbers ranging from 1.1 to 1.7, at both acceleration and deceleration conditions. The optimized biplane airfoils dramatically reduces the effects of the choked flow and flow-hysteresis phenomena, while maintaining a certain degree of favorable shockwave interaction effects at the design Mach number. Compared to a diamond shaped single airfoil of the same total thickness, the wave drag of our optimized biplane is lower at almost all Mach numbers, and is significantly lower at the design Mach number.
Federated learning (FL) enables distributed agents to collaboratively learn a centralized model without sharing their raw data with each other. However, data locality does not provide sufficient privacy protection, and … Federated learning (FL) enables distributed agents to collaboratively learn a centralized model without sharing their raw data with each other. However, data locality does not provide sufficient privacy protection, and it is desirable to facilitate FL with rigorous differential privacy (DP) guarantee. Existing DP mechanisms would introduce random noise with magnitude proportional to the model size, which can be quite large in deep neural networks. In this paper, we propose a new FL framework with sparsification-amplified privacy. Our approach integrates random sparsification with gradient perturbation on each agent to amplify privacy guarantee. Since sparsification would increase the number of communication rounds required to achieve a certain target accuracy, which is unfavorable for DP guarantee, we further introduce acceleration techniques to help reduce the privacy cost. We rigorously analyze the convergence of our approach and utilize Renyi DP to tightly account the end-to-end DP guarantee. Extensive experiments on benchmark datasets validate that our approach outperforms previous differentially-private FL approaches in both privacy guarantee and communication efficiency.
One of the critical pieces of the self-driving puzzle is understanding the surroundings of a self-driving vehicle (SDV) and predicting how these surroundings will change in the near future. To … One of the critical pieces of the self-driving puzzle is understanding the surroundings of a self-driving vehicle (SDV) and predicting how these surroundings will change in the near future. To address this task we propose MultiXNet, an end-to-end approach for detection and motion prediction based directly on lidar sensor data. This approach builds on prior work by handling multiple classes of traffic actors, adding a jointly trained second-stage trajectory refinement step, and producing a multimodal probability distribution over future actor motion that includes both multiple discrete traffic behaviors and calibrated continuous position uncertainties. The method was evaluated on large-scale, real-world data collected by a fleet of SDV s in several cities, with the results indicating that it outperforms existing state-of-the-art approaches.
3D object detection is a key component of many robotic applications such as self-driving vehicles. While many approaches rely on expensive 3D sensors such as LiDAR to produce accurate 3D … 3D object detection is a key component of many robotic applications such as self-driving vehicles. While many approaches rely on expensive 3D sensors such as LiDAR to produce accurate 3D estimates, methods that exploit stereo cameras have recently shown promising results at a lower cost. Existing approaches tackle this problem in two steps: first depth estimation from stereo images is performed to produce a pseudo LiDAR point cloud, which is then used as input to a 3D object detector. However, this approach is suboptimal due to the representation mismatch, as the two tasks are optimized in two different metric spaces. In this paper we propose a model that unifies these two tasks and performs them in the same metric space. Specifically, we directly construct a pseudo LiDAR feature volume (PLUME) in 3D space, which is then used to solve both depth estimation and object detection tasks. Our approach achieves state-of-the-art performance with much faster inference times when compared to existing methods on the challenging KITTI benchmark [1].
We consider a diffusive Nicholson's blowflies equation with non-local delay and study the stability of the uniform steady states and the possible Hopf bifurcation. By using the upper- and lower … We consider a diffusive Nicholson's blowflies equation with non-local delay and study the stability of the uniform steady states and the possible Hopf bifurcation. By using the upper- and lower solutions method, the global stability of constant steady states is obtained. We also discuss the local stability via analysis of the characteristic equation. Moreover, for a special kernel, the occurrence of Hopf bifurcation near the steady state solution and the stability of bifurcated periodic solutions are given via the centre manifold theory. Based on laboratory data and our theoretical results, we address the influence of various types of vaccinations in controlling the outbreak of blowflies.
With the proliferation of smart devices having built-in sensors, Internet connectivity, and programmable computation capability in the era of Internet of things (IoT), tremendous data is being generated at the … With the proliferation of smart devices having built-in sensors, Internet connectivity, and programmable computation capability in the era of Internet of things (IoT), tremendous data is being generated at the network edge. Federated learning is capable of analyzing the large amount of data from a distributed set of smart devices without requiring them to upload their data to a central place. However, the commonly-used federated learning algorithm is based on stochastic gradient descent (SGD) and not suitable for resource-constrained IoT environments due to its high communication resource requirement. Moreover, the privacy of sensitive data on smart devices has become a key concern and needs to be protected rigorously. This paper proposes a novel federated learning framework called DP-PASGD for training a machine learning model efficiently from the data stored across resource-constrained smart devices in IoT while guaranteeing differential privacy. The optimal schematic design of DP-PASGD that maximizes the learning performance while satisfying the limits on resource cost and privacy loss is formulated as an optimization problem, and an approximate solution method based on the convergence analysis of DP-PASGD is developed to solve the optimization problem efficiently. Numerical results based on real-world datasets verify the effectiveness of the proposed DP-PASGD scheme.
In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design … In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolve the conflicts between semantic and instance segmentation. Additionally, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves state-of-the-art performance with much faster inference. Code has been made available at: https://github.com/uber-research/UPSNet
Electronic tongues based on potentiometry offer the prospect of rapid and continuous chemical fingerprinting for portable and remote systems. The present contribution presents a technology platform including a miniaturized electronic … Electronic tongues based on potentiometry offer the prospect of rapid and continuous chemical fingerprinting for portable and remote systems. The present contribution presents a technology platform including a miniaturized electronic tongue based on electropolymerized ion-sensitive films, microcontroller-based data acquisition, a smartphone interface and cloud computing back-end for data storage and deployment of machine learning models. The sensor array records a series of differential voltages without use of a true reference electrode and the resulting time-series potentiometry data is used to train supervised machine learning algorithms. For trained systems, inferencing tasks such as the classification of liquids are realized within less than 1 minute including data acquisition at the edge and inference using the cloud-deployed machine learning model. Preliminary demonstration of the complete electronic tongue technology stack is reported for the classification of beverages and mineral water.
We propose a motion forecasting model that exploits a novel structured map representation as well as actor-map interactions. Instead of encoding vectorized maps as raster images, we construct a lane … We propose a motion forecasting model that exploits a novel structured map representation as well as actor-map interactions. Instead of encoding vectorized maps as raster images, we construct a lane graph from raw map data to explicitly preserve the map structure. To capture the complex topology and long range dependencies of the lane graph, we propose LaneGCN which extends graph convolutions with multiple adjacency matrices and along-lane dilation. To capture the complex interactions between actors and maps, we exploit a fusion network consisting of four types of interactions, actor-to-lane, lane-to-lane, lane-to-actor and actor-to-actor. Powered by LaneGCN and actor-map interactions, our model is able to predict accurate and realistic multi-modal trajectories. Our approach significantly outperforms the state-of-the-art on the large scale Argoverse motion forecasting benchmark.
3D object detection is a key component of many robotic applications such as self-driving vehicles. While many approaches rely on expensive 3D sensors such as LiDAR to produce accurate 3D … 3D object detection is a key component of many robotic applications such as self-driving vehicles. While many approaches rely on expensive 3D sensors such as LiDAR to produce accurate 3D estimates, methods that exploit stereo cameras have recently shown promising results at a lower cost. Existing approaches tackle this problem in two steps: first depth estimation from stereo images is performed to produce a pseudo LiDAR point cloud, which is then used as input to a 3D object detector. However, this approach is suboptimal due to the representation mismatch, as the two tasks are optimized in two different metric spaces. In this paper we propose a model that unifies these two tasks and performs them in the same metric space. Specifically, we directly construct a pseudo LiDAR feature volume (PLUME) in 3D space, which is then used to solve both depth estimation and object detection tasks. Our approach achieves state-of-the-art performance with much faster inference times when compared to existing methods on the challenging KITTI benchmark.
Many modern robotics systems employ LiDAR as their main sensing modality due to its geometrical richness. Rolling shutter LiDARs are particularly common, in which an array of lasers scans the … Many modern robotics systems employ LiDAR as their main sensing modality due to its geometrical richness. Rolling shutter LiDARs are particularly common, in which an array of lasers scans the scene from a rotating base. Points are emitted as a stream of packets, each covering a sector of the 360° coverage. Modern perception algorithms wait for the full sweep to be built before processing the data, which introduces an additional latency. For typical 10Hz LiDARs this will be 100ms. As a consequence, by the time an output is produced, it no longer accurately reflects the state of the world. This poses a challenge, as robotics applications require minimal reaction times, such that maneuvers can be quickly planned in the event of a safety-critical situation. In this paper we propose StrObe, a novel approach that minimizes latency by ingesting LiDAR packets and emitting a stream of detections without waiting for the full sweep to be built. StrObe reuses computations from previous packets and iteratively updates a latent spatial representation of the scene, which acts as a memory, as new evidence comes in, resulting in accurate low-latency perception. We demonstrate the effectiveness of our approach on a large scale real-world dataset, showing that StrObe far outperforms the state-of-the-art when latency is taken into account, and matches the performance in the traditional setting.
This paper addresses the aerodynamic performance of Busemann type supersonic biplane at off-design conditions. An adjoint based optimization technique is used to optimize the aerodynamic shape of the biplane to … This paper addresses the aerodynamic performance of Busemann type supersonic biplane at off-design conditions. An adjoint based optimization technique is used to optimize the aerodynamic shape of the biplane to reduce the wave drag at a series of Mach numbers ranging from 1.1 to 1.7, at both acceleration and deceleration conditions. The optimized biplane airfoils dramatically reduces the effects of the choked flow and flow-hysteresis phenomena, while maintaining a certain degree of favorable shockwave interaction effects at the design Mach number. Compared to a diamond shaped single airfoil of the same total thickness, the wave drag of our optimized biplane is lower at almost all Mach numbers, and is significantly lower at the design Mach number.
In this paper, we propose PolyTransform, a novel instance segmentation algorithm that produces precise, geometry-preserving masks by combining the strengths of prevailing segmentation approaches and modern polygon-based methods. In particular, … In this paper, we propose PolyTransform, a novel instance segmentation algorithm that produces precise, geometry-preserving masks by combining the strengths of prevailing segmentation approaches and modern polygon-based methods. In particular, we first exploit a segmentation network to generate instance masks. We then convert the masks into a set of polygons that are then fed to a deforming network that transforms the polygons such that they better fit the object boundaries. Our experiments on the challenging Cityscapes dataset show that our PolyTransform significantly improves the performance of the backbone instance segmentation network and ranks 1st on the Cityscapes test-set leaderboard. We also show impressive gains in the interactive annotation setting. We release the code at https://github.com/uber-research/PolyTransform.
Federated learning is a machine learning setting where a set of edge devices collaboratively train a model under the orchestration of a central server without sharing their local data. At … Federated learning is a machine learning setting where a set of edge devices collaboratively train a model under the orchestration of a central server without sharing their local data. At each communication round of federated learning, edge devices perform multiple steps of stochastic gradient descent with their local data and then upload the computation results to the server for model update. During this process, the challenge of privacy leakage arises due to the information exchange between edge devices and the server when the server is not fully trusted. While some previous privacy-preserving mechanisms could readily be used for federated learning, they usually come at a high cost on convergence of the algorithm and utility of the learned model. In this paper, we develop a federated learning approach that addresses the privacy challenge without much degradation on model utility through a combination of local gradient perturbation, secure aggregation, and zero-concentrated differential privacy (zCDP). We provide a tight end-to-end privacy guarantee of our approach and analyze its theoretical convergence rates. Through extensive numerical experiments on real-world datasets, we demonstrate the effectiveness of our proposed method and show its superior trade-off between privacy and model utility.
In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and … In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and 3D object detection as well as ground estimation and depth completion. Our experiments show that all these tasks are complementary and help the network learn better representations by fusing information at various levels. Importantly, our approach leads the KITTI benchmark on 2D, 3D and BEV object detection, while being real time.
Reconstructing objects from real world data and rendering them at novel views is critical to bringing realism, diversity and scale to simulation for robotics training and testing. In this work, … Reconstructing objects from real world data and rendering them at novel views is critical to bringing realism, diversity and scale to simulation for robotics training and testing. In this work, we present NeuSim, a novel approach that estimates accurate geometry and realistic appearance from sparse in-the-wild data captured at distance and at limited viewpoints. Towards this goal, we represent the object surface as a neural signed distance function and leverage both LiDAR and camera sensor data to reconstruct smooth and accurate geometry and normals. We model the object appearance with a robust physics-inspired reflectance representation effective for in-the-wild data. Our experiments show that NeuSim has strong view synthesis performance on challenging scenarios with sparse training views. Furthermore, we showcase composing NeuSim assets into a virtual world and generating realistic multi-sensor data for evaluating self-driving perception models. The supplementary material can be found at the project website: https://waabi.ai/research/neusim/
In this paper we tackle the problem of scene flow estimation in the context of self-driving. We leverage deep learning techniques as well as strong priors as in our application … In this paper we tackle the problem of scene flow estimation in the context of self-driving. We leverage deep learning techniques as well as strong priors as in our application domain the motion of the scene can be composed by the motion of the robot and the 3D motion of the actors in the scene. We formulate the problem as energy minimization in a deep structured model, which can be solved efficiently in the GPU by unrolling a Gaussian-Newton solver. Our experiments in the challenging KITTI scene flow dataset show that we outperform the state-of-the-art by a very large margin, while being 800 times faster.
Federated learning (FL) is designed to preserve data privacy during model training, where the data remains on the client side (i.e., IoT devices), and only model updates of clients are … Federated learning (FL) is designed to preserve data privacy during model training, where the data remains on the client side (i.e., IoT devices), and only model updates of clients are shared iteratively for collaborative learning. However, this process is vulnerable to privacy attacks and Byzantine attacks: the local model updates shared throughout the FL network will leak private information about the local training data, and they can also be maliciously crafted by Byzantine attackers to disturb the learning. In this paper, we propose a new FL scheme that guarantees rigorous privacy and simultaneously enhances system robustness against Byzantine attacks. Our approach introduces sparsification-and momentum-driven variance reduction into the client-level differential privacy (DP) mechanism, to defend against Byzantine attackers. The security design does not violate the privacy guarantee of the client-level DP mechanism; hence, our approach achieves the same client-level DP guarantee as the state-of-the-art. We conduct extensive experiments on both IID and non-IID datasets and different tasks and evaluate the performance of our approach against different Byzantine attacks by comparing it with state-of-the-art defense methods. The results of our experiments show the efficacy of our framework and demonstrate its ability to improve system robustness against Byzantine attacks while achieving a strong privacy guarantee.
We tackle the problem of joint perception and motion forecasting in the context of self-driving vehicles. Towards this goal we propose PnPNet, an end-to-end model that takes as input sequential … We tackle the problem of joint perception and motion forecasting in the context of self-driving vehicles. Towards this goal we propose PnPNet, an end-to-end model that takes as input sequential sensor data, and outputs at each time step object tracks and their future trajectories. The key component is a novel tracking module that generates object tracks online from detections and exploits trajectory level features for motion forecasting. Specifically, the object tracks get updated at each time step by solving both the data association problem and the trajectory estimation problem. Importantly, the whole model is end-to-end trainable and benefits from joint optimization of all tasks. We validate PnPNet on two large-scale driving datasets, and show significant improvements over the state-of-the-art with better occlusion recovery and more accurate future prediction.
Graph classification has practical applications in diverse fields. Recent studies show that graph-based machine learning models are especially vulnerable to adversarial perturbations due to the non i.i. d nature of … Graph classification has practical applications in diverse fields. Recent studies show that graph-based machine learning models are especially vulnerable to adversarial perturbations due to the non i.i. d nature of graph data. By adding or deleting a small number of edges in the graph, adversaries could greatly change the graph label predicted by a graph classification model. In this work, we propose to build a smoothed graph classification model with certified robustness guarantee. We have proven that the resulting graph classification model would output the same prediction for a graph under l <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">0</sub> bounded adversarial perturbation. We also evaluate the effectiveness of our approach under graph convolutional network (GCN) based multi-class graph classification model.
Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for … Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose Copilot4D, a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer as discrete diffusion and enhance it with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, Copilot4D reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.
Federated learning has received significant interests recently due to its capability of learning a shared machine learning model across smart devices without accessing their private data in the era of … Federated learning has received significant interests recently due to its capability of learning a shared machine learning model across smart devices without accessing their private data in the era of Internet of things. This paper jointly considers two critical issues of federated learning -- data privacy and communication efficiency -- and develops a communication-efficient and differentially-private federated learning scheme called CPFed. The main challenge in addressing both issues together lies in the fact that data compression techniques often lead to an increased number of training iterations required for achieving some desired training loss due to the compression errors, while the differential privacy guarantee usually deteriorates with respect to the number of training iterations. To reconcile this dilemma, we propose to use sparsified privacy-masking that first adds random noise to the model update and then applies unbiased random sparsifier before uploading the model update at each device in federated learning. By using sparsified privacy-masking, our proposed CPFed scheme can achieve high communication efficiency and strong data privacy guarantee at the same time while preserving model accuracy. We provide an explict end-to-end privacy guarantee of CPFed using zero-concentrated differential privacy and give its theoretical convergence rates for both convex and non-convex models. Through extensive numerical experiments on real-world datasets, we demonstrate the effectiveness and efficiency of our proposed method.
The System Analysis Module (SAM), developed and maintained by Argonne National Laboratory, is designed to provide whole-plant transient safety analysis capabilities for a number of advanced non–light water reactors, including … The System Analysis Module (SAM), developed and maintained by Argonne National Laboratory, is designed to provide whole-plant transient safety analysis capabilities for a number of advanced non–light water reactors, including sodium-cooled fast reactor (SFR), lead-cooled fast reactor (LFR), and molten salt reactor (MSR)/fluoride-salt-cooled high-temperature reactor (FHR) designs. SAM is primarily constructed as a systems-level analysis tool, with the potential to incorporate reduced order models from three-dimensional computational fluid dynamics (CFD) simulations to improve characterization of complex, multidimensional physics. It is recognized that the computational expense associated with CFD can be intractable for various engineering analyses, such as uncertainty quantification, inference, and design optimization. This paper explores the reducibility of a SAM model using recent advances in randomized linear algebra techniques, which attempt to find recurring patterns in the various realizations generated by a model after randomly perturbing all its input parameters. The reduction is described in terms of fewer degrees of freedom (DOFs), referred to as the active DOFs, for the model variables such as input model parameters and model responses. The results indicate that there is significant room for additional reduction that may be leveraged for additional computational gains when employing SAM for engineering-intensive analyses that require repeated model executions. Different from physics-based reduction approaches, the proposed approach allows one to estimate upper bounds on the reduction errors, which are rigorously developed in this work. Finally, different methods for surrogate model construction, such as regression and neural network–based training, are employed to correlate the input and output active DOFs, which are related back to the original variables using matrix-based linear transformations.
Deep network models perform excellently on In-Distribution (ID) data, but can significantly fail on Out-Of-Distribution (OOD) data. While developing methods focus on improving OOD generalization, few attention has been paid … Deep network models perform excellently on In-Distribution (ID) data, but can significantly fail on Out-Of-Distribution (OOD) data. While developing methods focus on improving OOD generalization, few attention has been paid to evaluating the capability of models to handle OOD data. This study is devoted to analyzing the problem of experimental ID test and designing OOD test paradigm to accurately evaluate the practical performance. Our analysis is based on an introduced categorization of three types of distribution shifts to generate OOD data. Main observations include: (1) ID test fails in neither reflecting the actual performance of a single model nor comparing between different models under OOD data. (2) The ID test failure can be ascribed to the learned marginal and conditional spurious correlations resulted from the corresponding distribution shifts. Based on this, we propose novel OOD test paradigms to evaluate the generalization capacity of models to unseen data, and discuss how to use OOD test results to find bugs of models to guide model debugging.
The system thermal-hydraulic (STH) code SAM has been coupled to the computational fluid dynamics (CFD) code Simcenter STARCCM+ utilizing a hybrid domain overlapping coupling method in space and an explicit … The system thermal-hydraulic (STH) code SAM has been coupled to the computational fluid dynamics (CFD) code Simcenter STARCCM+ utilizing a hybrid domain overlapping coupling method in space and an explicit coupling in time. The coupling aims to extend the STH code's applicability to scenarios where local momentum and energy transfers are important yet difficult for STH standalone models to capture, such as three-dimensional (3D) mixing and thermal stratification. The coupling method's numerical stability was previously verified against multiple test cases including two closed-loop configurations, and it was validated against a double T-junction experiment with 3D scalar mixing.
In this paper, an Adjoint-based automatic design methodology has been applied to redesign the wing of the P51 \Dago Red, an aircraft competing in the Reno Air Races. The aircraft … In this paper, an Adjoint-based automatic design methodology has been applied to redesign the wing of the P51 \Dago Red, an aircraft competing in the Reno Air Races. The aircraft reaches speeds above 500 MPH and encounters compressibility drag due to the appearance of shock waves. The objective is to delay drag rise without altering the wing structure. Hence the shape modications are restricted to adding a bump on the wing surface, allowing only outward movement. Moreover, the changes are restricted to a limited chord-wise range. Results indicate that the perturbations created by this bump propagate along the characteristics and are reected back from the sonic line to weaken the shock. With this bump on the wing, it is expected that Dago Red can reach the speed of 550 MPH, which will create a new World’s speed record for a propeller driven airplane.
SAM, a plant-level system analysis tool for advanced reactors (SFR, LFR, MSR/FHR) is under development at Argonne. As a modern system code, SAM aims to improve the predictions of 3D … SAM, a plant-level system analysis tool for advanced reactors (SFR, LFR, MSR/FHR) is under development at Argonne. As a modern system code, SAM aims to improve the predictions of 3D flows relevant to reactor safety during transient conditions. In order to fulfill this goal, one approach is to implement modeling of turbulent flow in SAM through establishing an embedded surrogate model for Reynolds stress/turbulence viscosity based on machine learning techniques. The proposed approach is based on an assumption that there exists a functional dependency relationship between local flow features and local turbulence viscosity or Reynolds stress. There have been very limited studies performed to validate this assumption. This paper documents a case study to examine the assumption in a scenario of potential reactor applications. The work doesn't aim to theoretically validate the assumption, but practically validate the assumption within the limited application domain. From the methodological point of view, the approach used in this paper could be classified into the so-called Type I machine learning (ML) approach, where a scale separation assumption is proposed claiming that conservation equations and closure relations are scale separable, for which the turbulence models are local rather than global. The CFD case studied in this work is a 3D transient thermal stratification tank flow problem performed using a Reynolds-averaged Navier-Stokes turbulence model in STARCCM+ code. Flow information of all geometric points in all timesteps is collected as training data and test data.
Graph classification has practical applications in diverse fields. Recent studies show that graph-based machine learning models are especially vulnerable to adversarial perturbations due to the non i.i.d nature of graph … Graph classification has practical applications in diverse fields. Recent studies show that graph-based machine learning models are especially vulnerable to adversarial perturbations due to the non i.i.d nature of graph data. By adding or deleting a small number of edges in the graph, adversaries could greatly change the graph label predicted by a graph classification model. In this work, we propose to build a smoothed graph classification model with certified robustness guarantee. We have proven that the resulting graph classification model would output the same prediction for a graph under $l_0$ bounded adversarial perturbation. We also evaluate the effectiveness of our approach under graph convolutional network (GCN) based multi-class graph classification model.
Federated Learning rests on the notion of training a global model distributedly on various devices. Under this setting, users' devices perform computations on their own data and then share the … Federated Learning rests on the notion of training a global model distributedly on various devices. Under this setting, users' devices perform computations on their own data and then share the results with the cloud server to update the global model. A fundamental issue in such systems is to effectively incentivize user participation. The users suffer from privacy leakage of their local data during the federated model training process. Without well-designed incentives, self-interested users will be unwilling to participate in federated learning tasks and contribute their private data. To bridge this gap, in this paper, we adopt the game theory to design an effective incentive mechanism, which selects users that are most likely to provide reliable data and compensates for their costs of privacy leakage. We formulate our problem as a two-stage Stackelberg game and solve the game's equilibrium. Effectiveness of the proposed mechanism is demonstrated by extensive simulations.
In this paper, we develop a machine learning-based Bayesian approach to inversely quantify and reduce the uncertainties of the two-fluid model-based multiphase computational fluid dynamics (MCFD) for bubbly flow simulations. … In this paper, we develop a machine learning-based Bayesian approach to inversely quantify and reduce the uncertainties of the two-fluid model-based multiphase computational fluid dynamics (MCFD) for bubbly flow simulations. The proposed approach is supported by high-resolution two-phase flow measurement techniques, including double-sensor conductivity probes, high-speed imaging, and particle image velocimetry. Local distribution of key physical quantities of interest (QoIs), including void fraction and phasic velocities, are obtained to support the modular Bayesian inference. In the process, the epistemic uncertainties of the closure relations are inversely quantified, the aleatory uncertainties from stochastic fluctuation of the system are evaluated based on experimental uncertainty analysis. The combined uncertainties are then propagated through the MCFD solver to obtain uncertainties of QoIs, based on which probability-boxes are constructed for validation. The proposed approach relies on three machine learning methods: feedforward neural networks and principal component analysis for surrogate modeling, and Gaussian processes for model form uncertainty modeling. The whole process is implemented within the framework of open-source deep learning library PyTorch with graphics processing unit (GPU) acceleration, thus ensuring the efficiency of the computation. The results demonstrate that with the support of high-resolution data, the uncertainty of MCFD simulations can be significantly reduced.
The conservation equations for heat conduction are established based on the concept of thermal mass.We obtain a general heat conduction law which takes into account the spatial and temporal inertia … The conservation equations for heat conduction are established based on the concept of thermal mass.We obtain a general heat conduction law which takes into account the spatial and temporal inertia of thermal mass.The general law introduces a damped thermal wave equation.It reduces to the well-known CV model when the spatial inertia of heat flux and temperature and the temporal inertia of temperature are neglected,which indicates that the CV model only considers the temporal inertia of heat flux.Numerical simulations on the propagation and superposition of thermal waves show that for small thermal perturbation the CV model agrees with the thermal wave equation based on the thermal mass theory.For larger thermal perturbation,however,the physically impossible phenomenon pre-dicted by CV model,i.e.the negative temperature induced by the thermal wave superposition,is eliminated by the general heat conduction law,which demonstrates that the present heat conduction law based on the thermal mass theory is more reasonable.
Accurate predictions and uncertainty quantification (UQ) are essential for decision-making in risk-sensitive fields such as system safety modeling. Deep ensembles (DEs) are efficient and scalable methods for UQ in Deep … Accurate predictions and uncertainty quantification (UQ) are essential for decision-making in risk-sensitive fields such as system safety modeling. Deep ensembles (DEs) are efficient and scalable methods for UQ in Deep Neural Networks (DNNs); however, their performance is limited when constructed by simply retraining the same DNN multiple times with randomly sampled initializations. To overcome this limitation, we propose a novel method that combines Bayesian optimization (BO) with DE, referred to as BODE, to enhance both predictive accuracy and UQ. We apply BODE to a case study involving a Densely connected Convolutional Neural Network (DCNN) trained on computational fluid dynamics (CFD) data to predict eddy viscosity in sodium fast reactor thermal stratification modeling. Compared to a manually tuned baseline ensemble, BODE estimates total uncertainty approximately four times lower in a noise-free environment, primarily due to the baseline's overestimation of aleatoric uncertainty. Specifically, BODE estimates aleatoric uncertainty close to zero, while aleatoric uncertainty dominates the total uncertainty in the baseline ensemble. We also observe a reduction of more than 30% in epistemic uncertainty. When Gaussian noise with standard deviations of 5% and 10% is introduced into the data, BODE accurately fits the data and estimates uncertainty that aligns with the data noise. These results demonstrate that BODE effectively reduces uncertainty and enhances predictions in data-driven models, making it a flexible approach for various applications requiring accurate predictions and robust UQ.
Backdoor attacks present a significant threat to the robustness of Federated Learning (FL) due to their stealth and effectiveness. They maintain both the main task of the FL system and … Backdoor attacks present a significant threat to the robustness of Federated Learning (FL) due to their stealth and effectiveness. They maintain both the main task of the FL system and the backdoor task simultaneously, causing malicious models to appear statistically similar to benign ones, which enables them to evade detection by existing defense methods. We find that malicious parameters in backdoored models are inactive on the main task, resulting in a significantly large empirical loss during the machine unlearning process on clean inputs. Inspired by this, we propose MASA, a method that utilizes individual unlearning on local models to identify malicious models in FL. To improve the performance of MASA in challenging non-independent and identically distributed (non-IID) settings, we design pre-unlearning model fusion that integrates local models with knowledge learned from other datasets to mitigate the divergence in their unlearning behaviors caused by the non-IID data distributions of clients. Additionally, we propose a new anomaly detection metric with minimal hyperparameters to filter out malicious models efficiently. Extensive experiments on IID and non-IID datasets across six different attacks validate the effectiveness of MASA. To the best of our knowledge, this is the first work to leverage machine unlearning to identify malicious models in FL. Code is available at \url{https://github.com/JiiahaoXU/MASA}.
Vector data is one of the two core data structures in geographic information science (GIS), essential for accurately storing and representing geospatial information. Shapefile, the most widely used vector data … Vector data is one of the two core data structures in geographic information science (GIS), essential for accurately storing and representing geospatial information. Shapefile, the most widely used vector data format, has become the industry standard supported by all major geographic information systems. However, processing this data typically requires specialized GIS knowledge and skills, creating a barrier for researchers from other fields and impeding interdisciplinary research in spatial data analysis. Moreover, while large language models (LLMs) have made significant advancements in natural language processing and task automation, they still face challenges in handling the complex spatial and topological relationships inherent in GIS vector data. To address these challenges, we propose ShapefileGPT, an innovative framework powered by LLMs, specifically designed to automate Shapefile tasks. ShapefileGPT utilizes a multi-agent architecture, in which the planner agent is responsible for task decomposition and supervision, while the worker agent executes the tasks. We developed a specialized function library for handling Shapefiles and provided comprehensive API documentation, enabling the worker agent to operate Shapefiles efficiently through function calling. For evaluation, we developed a benchmark dataset based on authoritative textbooks, encompassing tasks in categories such as geometric operations and spatial queries. ShapefileGPT achieved a task success rate of 95.24%, outperforming the GPT series models. In comparison to traditional LLMs, ShapefileGPT effectively handles complex vector data analysis tasks, overcoming the limitations of traditional LLMs in spatial analysis. This breakthrough opens new pathways for advancing automation and intelligence in the GIS field, with significant potential in interdisciplinary data analysis and application contexts.
Federated Learning (FL) enables multiple clients to collaboratively train a model without sharing their local data. Yet the FL system is vulnerable to well-designed Byzantine attacks, which aim to disrupt … Federated Learning (FL) enables multiple clients to collaboratively train a model without sharing their local data. Yet the FL system is vulnerable to well-designed Byzantine attacks, which aim to disrupt the model training process by uploading malicious model updates. Existing robust aggregation rule-based defense methods overlook the diversity of magnitude and direction across different layers of the model updates, resulting in limited robustness performance, particularly in non-IID settings. To address these challenges, we propose the Layer-Adaptive Sparsified Model Aggregation (LASA) approach, which combines pre-aggregation sparsification with layer-wise adaptive aggregation to improve robustness. Specifically, LASA includes a pre-aggregation sparsification module that sparsifies updates from each client before aggregation, reducing the impact of malicious parameters and minimizing the interference from less important parameters for the subsequent filtering process. Based on sparsified updates, a layer-wise adaptive filter then adaptively selects benign layers using both magnitude and direction metrics across all clients for aggregation. We provide the detailed theoretical robustness analysis of LASA and the resilience analysis for the FL integrated with LASA. Extensive experiments are conducted on various IID and non-IID datasets. The numerical results demonstrate the effectiveness of LASA. Code is available at \url{https://github.com/JiiahaoXU/LASA}.
The system thermal-hydraulic (STH) code SAM has been coupled to the computational fluid dynamics (CFD) code Simcenter STARCCM+ utilizing a hybrid domain overlapping coupling method in space and an explicit … The system thermal-hydraulic (STH) code SAM has been coupled to the computational fluid dynamics (CFD) code Simcenter STARCCM+ utilizing a hybrid domain overlapping coupling method in space and an explicit coupling in time. The coupling aims to extend the STH code's applicability to scenarios where local momentum and energy transfers are important yet difficult for STH standalone models to capture, such as three-dimensional (3D) mixing and thermal stratification. The coupling method's numerical stability was previously verified against multiple test cases including two closed-loop configurations, and it was validated against a double T-junction experiment with 3D scalar mixing.
Federated learning (FL) is designed to preserve data privacy during model training, where the data remains on the client side (i.e., IoT devices), and only model updates of clients are … Federated learning (FL) is designed to preserve data privacy during model training, where the data remains on the client side (i.e., IoT devices), and only model updates of clients are shared iteratively for collaborative learning. However, this process is vulnerable to privacy attacks and Byzantine attacks: the local model updates shared throughout the FL network will leak private information about the local training data, and they can also be maliciously crafted by Byzantine attackers to disturb the learning. In this paper, we propose a new FL scheme that guarantees rigorous privacy and simultaneously enhances system robustness against Byzantine attacks. Our approach introduces sparsification-and momentum-driven variance reduction into the client-level differential privacy (DP) mechanism, to defend against Byzantine attackers. The security design does not violate the privacy guarantee of the client-level DP mechanism; hence, our approach achieves the same client-level DP guarantee as the state-of-the-art. We conduct extensive experiments on both IID and non-IID datasets and different tasks and evaluate the performance of our approach against different Byzantine attacks by comparing it with state-of-the-art defense methods. The results of our experiments show the efficacy of our framework and demonstrate its ability to improve system robustness against Byzantine attacks while achieving a strong privacy guarantee.
Reconstructing objects from real world data and rendering them at novel views is critical to bringing realism, diversity and scale to simulation for robotics training and testing. In this work, … Reconstructing objects from real world data and rendering them at novel views is critical to bringing realism, diversity and scale to simulation for robotics training and testing. In this work, we present NeuSim, a novel approach that estimates accurate geometry and realistic appearance from sparse in-the-wild data captured at distance and at limited viewpoints. Towards this goal, we represent the object surface as a neural signed distance function and leverage both LiDAR and camera sensor data to reconstruct smooth and accurate geometry and normals. We model the object appearance with a robust physics-inspired reflectance representation effective for in-the-wild data. Our experiments show that NeuSim has strong view synthesis performance on challenging scenarios with sparse training views. Furthermore, we showcase composing NeuSim assets into a virtual world and generating realistic multi-sensor data for evaluating self-driving perception models. The supplementary material can be found at the project website: https://waabi.ai/research/neusim/
This paper is concerned with a class of generalized quasilinear Schrödinger equations which have appeared from plasma physics, as well as high-power ultrashort laser in matter.Combining the variable replacement and … This paper is concerned with a class of generalized quasilinear Schrödinger equations which have appeared from plasma physics, as well as high-power ultrashort laser in matter.Combining the variable replacement and the Schauder-Tychonoff fixed point theorem, we establish the nonexistence and existence of positive radial ground state solutions for this problem.
Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for … Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose Copilot4D, a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer as discrete diffusion and enhance it with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, Copilot4D reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.
Reconstructing objects from real world data and rendering them at novel views is critical to bringing realism, diversity and scale to simulation for robotics training and testing. In this work, … Reconstructing objects from real world data and rendering them at novel views is critical to bringing realism, diversity and scale to simulation for robotics training and testing. In this work, we present NeuSim, a novel approach that estimates accurate geometry and realistic appearance from sparse in-the-wild data captured at distance and at limited viewpoints. Towards this goal, we represent the object surface as a neural signed distance function and leverage both LiDAR and camera sensor data to reconstruct smooth and accurate geometry and normals. We model the object appearance with a robust physics-inspired reflectance representation effective for in-the-wild data. Our experiments show that NeuSim has strong view synthesis performance on challenging scenarios with sparse training views. Furthermore, we showcase composing NeuSim assets into a virtual world and generating realistic multi-sensor data for evaluating self-driving perception models.
Vision models with high overall accuracy often exhibit systematic errors in specific scenarios, posing potential serious safety concerns. Diagnosing bugs of vision models is gaining increased attention, however traditional diagnostic … Vision models with high overall accuracy often exhibit systematic errors in specific scenarios, posing potential serious safety concerns. Diagnosing bugs of vision models is gaining increased attention, however traditional diagnostic approaches require annotation efforts (eg rich metadata accompanying each samples of CelebA). To address this issue,We propose a language-assisted diagnostic method that uses texts instead of images to diagnose bugs in vision models based on multi-modal models (eg CLIP). Our approach connects the embedding space of CLIP with the buggy vision model to be diagnosed; meanwhile, utilizing a shared classifier and the cross-modal transferability of embedding space from CLIP, the text-branch of CLIP become a proxy model to find bugs in the buggy model. The proxy model can classify texts paired with images. During the diagnosis, a Large Language Model (LLM) is employed to obtain task-relevant corpora, and this corpora is used to extract keywords. Descriptions constructed with templates containing these keywords serve as input text to probe errors in the proxy model. Finally, we validate the ability to diagnose existing visual models using language on the Waterbirds and CelebA datasets, we can identify bugs comprehensible to human experts, uncovering not only known bugs but also previously unknown ones.
Deep network models perform excellently on In-Distribution (ID) data, but can significantly fail on Out-Of-Distribution (OOD) data. While developing methods focus on improving OOD generalization, few attention has been paid … Deep network models perform excellently on In-Distribution (ID) data, but can significantly fail on Out-Of-Distribution (OOD) data. While developing methods focus on improving OOD generalization, few attention has been paid to evaluating the capability of models to handle OOD data. This study is devoted to analyzing the problem of experimental ID test and designing OOD test paradigm to accurately evaluate the practical performance. Our analysis is based on an introduced categorization of three types of distribution shifts to generate OOD data. Main observations include: (1) ID test fails in neither reflecting the actual performance of a single model nor comparing between different models under OOD data. (2) The ID test failure can be ascribed to the learned marginal and conditional spurious correlations resulted from the corresponding distribution shifts. Based on this, we propose novel OOD test paradigms to evaluate the generalization capacity of models to unseen data, and discuss how to use OOD test results to find bugs of models to guide model debugging.
3D object detection is a key component of many robotic applications such as self-driving vehicles. While many approaches rely on expensive 3D sensors such as LiDAR to produce accurate 3D … 3D object detection is a key component of many robotic applications such as self-driving vehicles. While many approaches rely on expensive 3D sensors such as LiDAR to produce accurate 3D estimates, methods that exploit stereo cameras have recently shown promising results at a lower cost. Existing approaches tackle this problem in two steps: first depth estimation from stereo images is performed to produce a pseudo LiDAR point cloud, which is then used as input to a 3D object detector. However, this approach is suboptimal due to the representation mismatch, as the two tasks are optimized in two different metric spaces. In this paper we propose a model that unifies these two tasks and performs them in the same metric space. Specifically, we directly construct a pseudo LiDAR feature volume (PLUME) in 3D space, which is then used to solve both depth estimation and object detection tasks. Our approach achieves state-of-the-art performance with much faster inference times when compared to existing methods on the challenging KITTI benchmark [1].
Federated learning (FL) enables distributed agents to collaboratively learn a centralized model without sharing their raw data with each other. However, data locality does not provide sufficient privacy protection, and … Federated learning (FL) enables distributed agents to collaboratively learn a centralized model without sharing their raw data with each other. However, data locality does not provide sufficient privacy protection, and it is desirable to facilitate FL with rigorous differential privacy (DP) guarantee. Existing DP mechanisms would introduce random noise with magnitude proportional to the model size, which can be quite large in deep neural networks. In this paper, we propose a new FL framework with sparsification-amplified privacy. Our approach integrates random sparsification with gradient perturbation on each agent to amplify privacy guarantee. Since sparsification would increase the number of communication rounds required to achieve a certain target accuracy, which is unfavorable for DP guarantee, we further introduce acceleration techniques to help reduce the privacy cost. We rigorously analyze the convergence of our approach and utilize Renyi DP to tightly account the end-to-end DP guarantee. Extensive experiments on benchmark datasets validate that our approach outperforms previous differentially-private FL approaches in both privacy guarantee and communication efficiency.
One of the critical pieces of the self-driving puzzle is understanding the surroundings of a self-driving vehicle (SDV) and predicting how these surroundings will change in the near future. To … One of the critical pieces of the self-driving puzzle is understanding the surroundings of a self-driving vehicle (SDV) and predicting how these surroundings will change in the near future. To address this task we propose MultiXNet, an end-to-end approach for detection and motion prediction based directly on lidar sensor data. This approach builds on prior work by handling multiple classes of traffic actors, adding a jointly trained second-stage trajectory refinement step, and producing a multimodal probability distribution over future actor motion that includes both multiple discrete traffic behaviors and calibrated continuous position uncertainties. The method was evaluated on large-scale, real-world data collected by a fleet of SDV s in several cities, with the results indicating that it outperforms existing state-of-the-art approaches.
3D object detection is a key component of many robotic applications such as self-driving vehicles. While many approaches rely on expensive 3D sensors such as LiDAR to produce accurate 3D … 3D object detection is a key component of many robotic applications such as self-driving vehicles. While many approaches rely on expensive 3D sensors such as LiDAR to produce accurate 3D estimates, methods that exploit stereo cameras have recently shown promising results at a lower cost. Existing approaches tackle this problem in two steps: first depth estimation from stereo images is performed to produce a pseudo LiDAR point cloud, which is then used as input to a 3D object detector. However, this approach is suboptimal due to the representation mismatch, as the two tasks are optimized in two different metric spaces. In this paper we propose a model that unifies these two tasks and performs them in the same metric space. Specifically, we directly construct a pseudo LiDAR feature volume (PLUME) in 3D space, which is then used to solve both depth estimation and object detection tasks. Our approach achieves state-of-the-art performance with much faster inference times when compared to existing methods on the challenging KITTI benchmark.
Deep network models perform excellently on In-Distribution (ID) data, but can significantly fail on Out-Of-Distribution (OOD) data. While developing methods focus on improving OOD generalization, few attention has been paid … Deep network models perform excellently on In-Distribution (ID) data, but can significantly fail on Out-Of-Distribution (OOD) data. While developing methods focus on improving OOD generalization, few attention has been paid to evaluating the capability of models to handle OOD data. This study is devoted to analyzing the problem of experimental ID test and designing OOD test paradigm to accurately evaluate the practical performance. Our analysis is based on an introduced categorization of three types of distribution shifts to generate OOD data. Main observations include: (1) ID test fails in neither reflecting the actual performance of a single model nor comparing between different models under OOD data. (2) The ID test failure can be ascribed to the learned marginal and conditional spurious correlations resulted from the corresponding distribution shifts. Based on this, we propose novel OOD test paradigms to evaluate the generalization capacity of models to unseen data, and discuss how to use OOD test results to find bugs of models to guide model debugging.
Advanced nuclear reactors often exhibit complex thermal-fluid phenomena during transients. To accurately capture such phenomena, a coarse-mesh three-dimensional (3-D) modeling capability is desired for modern nuclear-system code. In the coarse-mesh … Advanced nuclear reactors often exhibit complex thermal-fluid phenomena during transients. To accurately capture such phenomena, a coarse-mesh three-dimensional (3-D) modeling capability is desired for modern nuclear-system code. In the coarse-mesh 3-D modeling of advanced-reactor transients that involve flow and heat transfer, accurately predicting the turbulent viscosity is a challenging task that requires an accurate and computationally efficient model to capture the unresolved fine-scale turbulence. In this paper, we propose a data-driven coarse-mesh turbulence model based on local flow features for the transient analysis of thermal mixing and stratification in a sodium-cooled fast reactor. The model has a coarse-mesh setup to ensure computational efficiency, while it is trained by fine-mesh computational fluid dynamics (CFD) data to ensure accuracy. A novel neural network architecture, combining a densely connected convolutional network and a long-short-term-memory network, is developed that can efficiently learn from the spatial-temporal CFD transient simulation results. The neural network model was trained and optimized on a loss-of-flow transient and demonstrated high accuracy in predicting the turbulent viscosity field during the whole transient. The trained model's generalization capability was also investigated on two other transients with different inlet conditions. The study demonstrates the potential of applying the proposed data-driven approach to support the coarse-mesh multi-dimensional modeling of advanced reactors.
The System Analysis Module (SAM), developed and maintained by Argonne National Laboratory, is designed to provide whole-plant transient safety analysis capabilities for a number of advanced non–light water reactors, including … The System Analysis Module (SAM), developed and maintained by Argonne National Laboratory, is designed to provide whole-plant transient safety analysis capabilities for a number of advanced non–light water reactors, including sodium-cooled fast reactor (SFR), lead-cooled fast reactor (LFR), and molten salt reactor (MSR)/fluoride-salt-cooled high-temperature reactor (FHR) designs. SAM is primarily constructed as a systems-level analysis tool, with the potential to incorporate reduced order models from three-dimensional computational fluid dynamics (CFD) simulations to improve characterization of complex, multidimensional physics. It is recognized that the computational expense associated with CFD can be intractable for various engineering analyses, such as uncertainty quantification, inference, and design optimization. This paper explores the reducibility of a SAM model using recent advances in randomized linear algebra techniques, which attempt to find recurring patterns in the various realizations generated by a model after randomly perturbing all its input parameters. The reduction is described in terms of fewer degrees of freedom (DOFs), referred to as the active DOFs, for the model variables such as input model parameters and model responses. The results indicate that there is significant room for additional reduction that may be leveraged for additional computational gains when employing SAM for engineering-intensive analyses that require repeated model executions. Different from physics-based reduction approaches, the proposed approach allows one to estimate upper bounds on the reduction errors, which are rigorously developed in this work. Finally, different methods for surrogate model construction, such as regression and neural network–based training, are employed to correlate the input and output active DOFs, which are related back to the original variables using matrix-based linear transformations.
Graph classification has practical applications in diverse fields. Recent studies show that graph-based machine learning models are especially vulnerable to adversarial perturbations due to the non i.i. d nature of … Graph classification has practical applications in diverse fields. Recent studies show that graph-based machine learning models are especially vulnerable to adversarial perturbations due to the non i.i. d nature of graph data. By adding or deleting a small number of edges in the graph, adversaries could greatly change the graph label predicted by a graph classification model. In this work, we propose to build a smoothed graph classification model with certified robustness guarantee. We have proven that the resulting graph classification model would output the same prediction for a graph under l <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">0</sub> bounded adversarial perturbation. We also evaluate the effectiveness of our approach under graph convolutional network (GCN) based multi-class graph classification model.
Federated Learning rests on the notion of training a global model distributedly on various devices. Under this setting, users' devices perform computations on their own data and then share the … Federated Learning rests on the notion of training a global model distributedly on various devices. Under this setting, users' devices perform computations on their own data and then share the results with the cloud server to update the global model. A fundamental issue in such systems is to effectively incentivize user participation. The users suffer from privacy leakage of their local data during the federated model training process. Without well-designed incentives, self-interested users will be unwilling to participate in federated learning tasks and contribute their private data. To bridge this gap, in this paper, we adopt the game theory to design an effective incentive mechanism, which selects users that are most likely to provide reliable data and compensates for their costs of privacy leakage. We formulate our problem as a two-stage Stackelberg game and solve the game's equilibrium. Effectiveness of the proposed mechanism is demonstrated by extensive simulations.
Federated learning has received significant interests recently due to its capability of learning a shared machine learning model across smart devices without accessing their private data in the era of … Federated learning has received significant interests recently due to its capability of learning a shared machine learning model across smart devices without accessing their private data in the era of Internet of things. This paper jointly considers two critical issues of federated learning -- data privacy and communication efficiency -- and develops a communication-efficient and differentially-private federated learning scheme called CPFed. The main challenge in addressing both issues together lies in the fact that data compression techniques often lead to an increased number of training iterations required for achieving some desired training loss due to the compression errors, while the differential privacy guarantee usually deteriorates with respect to the number of training iterations. To reconcile this dilemma, we propose to use sparsified privacy-masking that first adds random noise to the model update and then applies unbiased random sparsifier before uploading the model update at each device in federated learning. By using sparsified privacy-masking, our proposed CPFed scheme can achieve high communication efficiency and strong data privacy guarantee at the same time while preserving model accuracy. We provide an explict end-to-end privacy guarantee of CPFed using zero-concentrated differential privacy and give its theoretical convergence rates for both convex and non-convex models. Through extensive numerical experiments on real-world datasets, we demonstrate the effectiveness and efficiency of our proposed method.
In this paper, we propose PolyTransform, a novel instance segmentation algorithm that produces precise, geometry-preserving masks by combining the strengths of prevailing segmentation approaches and modern polygon-based methods. In particular, … In this paper, we propose PolyTransform, a novel instance segmentation algorithm that produces precise, geometry-preserving masks by combining the strengths of prevailing segmentation approaches and modern polygon-based methods. In particular, we first exploit a segmentation network to generate instance masks. We then convert the masks into a set of polygons that are then fed to a deforming network that transforms the polygons such that they better fit the object boundaries. Our experiments on the challenging Cityscapes dataset show that our PolyTransform significantly improves the performance of the backbone instance segmentation network and ranks 1st on the Cityscapes test-set leaderboard. We also show impressive gains in the interactive annotation setting.
We tackle the problem of joint perception and motion forecasting in the context of self-driving vehicles. Towards this goal we propose PnPNet, an end-to-end model that takes as input sequential … We tackle the problem of joint perception and motion forecasting in the context of self-driving vehicles. Towards this goal we propose PnPNet, an end-to-end model that takes as input sequential sensor data, and outputs at each time step object tracks and their future trajectories. The key component is a novel tracking module that generates object tracks online from detections and exploits trajectory level features for motion forecasting. Specifically, the object tracks get updated at each time step by solving both the data association problem and the trajectory estimation problem. Importantly, the whole model is end-to-end trainable and benefits from joint optimization of all tasks. We validate PnPNet on two large-scale driving datasets, and show significant improvements over the state-of-the-art with better occlusion recovery and more accurate future prediction.
With the proliferation of smart devices having built-in sensors, Internet connectivity, and programmable computation capability in the era of Internet of things (IoT), tremendous data is being generated at the … With the proliferation of smart devices having built-in sensors, Internet connectivity, and programmable computation capability in the era of Internet of things (IoT), tremendous data is being generated at the network edge. Federated learning is capable of analyzing the large amount of data from a distributed set of smart devices without requiring them to upload their data to a central place. However, the commonly-used federated learning algorithm is based on stochastic gradient descent (SGD) and not suitable for resource-constrained IoT environments due to its high communication resource requirement. Moreover, the privacy of sensitive data on smart devices has become a key concern and needs to be protected rigorously. This paper proposes a novel federated learning framework called DP-PASGD for training a machine learning model efficiently from the data stored across resource-constrained smart devices in IoT while guaranteeing differential privacy. The optimal schematic design of DP-PASGD that maximizes the learning performance while satisfying the limits on resource cost and privacy loss is formulated as an optimization problem, and an approximate solution method based on the convergence analysis of DP-PASGD is developed to solve the optimization problem efficiently. Numerical results based on real-world datasets verify the effectiveness of the proposed DP-PASGD scheme.
SAM, a plant-level system analysis tool for advanced reactors (SFR, LFR, MSR/FHR) is under development at Argonne. As a modern system code, SAM aims to improve the predictions of 3D … SAM, a plant-level system analysis tool for advanced reactors (SFR, LFR, MSR/FHR) is under development at Argonne. As a modern system code, SAM aims to improve the predictions of 3D flows relevant to reactor safety during transient conditions. In order to fulfill this goal, one approach is to implement modeling of turbulent flow in SAM through establishing an embedded surrogate model for Reynolds stress/turbulence viscosity based on machine learning techniques. The proposed approach is based on an assumption that there exists a functional dependency relationship between local flow features and local turbulence viscosity or Reynolds stress. There have been very limited studies performed to validate this assumption. This paper documents a case study to examine the assumption in a scenario of potential reactor applications. The work doesn't aim to theoretically validate the assumption, but practically validate the assumption within the limited application domain. From the methodological point of view, the approach used in this paper could be classified into the so-called Type I machine learning (ML) approach, where a scale separation assumption is proposed claiming that conservation equations and closure relations are scale separable, for which the turbulence models are local rather than global. The CFD case studied in this work is a 3D transient thermal stratification tank flow problem performed using a Reynolds-averaged Navier-Stokes turbulence model in STARCCM+ code. Flow information of all geometric points in all timesteps is collected as training data and test data.
We tackle the problem of joint perception and motion forecasting in the context of self-driving vehicles. Towards this goal we propose PnPNet, an end-to-end model that takes as input sequential … We tackle the problem of joint perception and motion forecasting in the context of self-driving vehicles. Towards this goal we propose PnPNet, an end-to-end model that takes as input sequential sensor data, and outputs at each time step object tracks and their future trajectories. The key component is a novel tracking module that generates object tracks online from detections and exploits trajectory level features for motion forecasting. Specifically, the object tracks get updated at each time step by solving both the data association problem and the trajectory estimation problem. Importantly, the whole model is end-to-end trainable and benefits from joint optimization of all tasks. We validate PnPNet on two large-scale driving datasets, and show significant improvements over the state-of-the-art with better occlusion recovery and more accurate future prediction.
We propose a motion forecasting model that exploits a novel structured map representation as well as actor-map interactions. Instead of encoding vectorized maps as raster images, we construct a lane … We propose a motion forecasting model that exploits a novel structured map representation as well as actor-map interactions. Instead of encoding vectorized maps as raster images, we construct a lane graph from raw map data to explicitly preserve the map structure. To capture the complex topology and long range dependencies of the lane graph, we propose LaneGCN which extends graph convolutions with multiple adjacency matrices and along-lane dilation. To capture the complex interactions between actors and maps, we exploit a fusion network consisting of four types of interactions, actor-to-lane, lane-to-lane, lane-to-actor and actor-to-actor. Powered by LaneGCN and actor-map interactions, our model is able to predict accurate and realistic multi-modal trajectories. Our approach significantly outperforms the state-of-the-art on the large scale Argoverse motion forecasting benchmark.
Graph classification has practical applications in diverse fields. Recent studies show that graph-based machine learning models are especially vulnerable to adversarial perturbations due to the non i.i.d nature of graph … Graph classification has practical applications in diverse fields. Recent studies show that graph-based machine learning models are especially vulnerable to adversarial perturbations due to the non i.i.d nature of graph data. By adding or deleting a small number of edges in the graph, adversaries could greatly change the graph label predicted by a graph classification model. In this work, we propose to build a smoothed graph classification model with certified robustness guarantee. We have proven that the resulting graph classification model would output the same prediction for a graph under $l_0$ bounded adversarial perturbation. We also evaluate the effectiveness of our approach under graph convolutional network (GCN) based multi-class graph classification model.
Federated Learning rests on the notion of training a global model distributedly on various devices. Under this setting, users' devices perform computations on their own data and then share the … Federated Learning rests on the notion of training a global model distributedly on various devices. Under this setting, users' devices perform computations on their own data and then share the results with the cloud server to update the global model. A fundamental issue in such systems is to effectively incentivize user participation. The users suffer from privacy leakage of their local data during the federated model training process. Without well-designed incentives, self-interested users will be unwilling to participate in federated learning tasks and contribute their private data. To bridge this gap, in this paper, we adopt the game theory to design an effective incentive mechanism, which selects users that are most likely to provide reliable data and compensates for their costs of privacy leakage. We formulate our problem as a two-stage Stackelberg game and solve the game's equilibrium. Effectiveness of the proposed mechanism is demonstrated by extensive simulations.
Federated learning is a machine learning setting where a set of edge devices collaboratively train a model under the orchestration of a central server without sharing their local data. At … Federated learning is a machine learning setting where a set of edge devices collaboratively train a model under the orchestration of a central server without sharing their local data. At each communication round of federated learning, edge devices perform multiple steps of stochastic gradient descent with their local data and then upload the computation results to the server for model update. During this process, the challenge of privacy leakage arises due to the information exchange between edge devices and the server when the server is not fully trusted. While some previous privacy-preserving mechanisms could readily be used for federated learning, they usually come at a high cost on convergence of the algorithm and utility of the learned model. In this paper, we develop a federated learning approach that addresses the privacy challenge without much degradation on model utility through a combination of local gradient perturbation, secure aggregation, and zero-concentrated differential privacy (zCDP). We provide a tight end-to-end privacy guarantee of our approach and analyze its theoretical convergence rates. Through extensive numerical experiments on real-world datasets, we demonstrate the effectiveness of our proposed method and show its superior trade-off between privacy and model utility.
In this paper, we develop a machine learning-based Bayesian approach to inversely quantify and reduce the uncertainties of the two-fluid model-based multiphase computational fluid dynamics (MCFD) for bubbly flow simulations. … In this paper, we develop a machine learning-based Bayesian approach to inversely quantify and reduce the uncertainties of the two-fluid model-based multiphase computational fluid dynamics (MCFD) for bubbly flow simulations. The proposed approach is supported by high-resolution two-phase flow measurement techniques, including double-sensor conductivity probes, high-speed imaging, and particle image velocimetry. Local distribution of key physical quantities of interest (QoIs), including void fraction and phasic velocities, are obtained to support the modular Bayesian inference. In the process, the epistemic uncertainties of the closure relations are inversely quantified, the aleatory uncertainties from stochastic fluctuation of the system are evaluated based on experimental uncertainty analysis. The combined uncertainties are then propagated through the MCFD solver to obtain uncertainties of QoIs, based on which probability-boxes are constructed for validation. The proposed approach relies on three machine learning methods: feedforward neural networks and principal component analysis for surrogate modeling, and Gaussian processes for model form uncertainty modeling. The whole process is implemented within the framework of open-source deep learning library PyTorch with graphics processing unit (GPU) acceleration, thus ensuring the efficiency of the computation. The results demonstrate that with the support of high-resolution data, the uncertainty of MCFD simulations can be significantly reduced.
Many modern robotics systems employ LiDAR as their main sensing modality due to its geometrical richness. Rolling shutter LiDARs are particularly common, in which an array of lasers scans the … Many modern robotics systems employ LiDAR as their main sensing modality due to its geometrical richness. Rolling shutter LiDARs are particularly common, in which an array of lasers scans the scene from a rotating base. Points are emitted as a stream of packets, each covering a sector of the 360° coverage. Modern perception algorithms wait for the full sweep to be built before processing the data, which introduces an additional latency. For typical 10Hz LiDARs this will be 100ms. As a consequence, by the time an output is produced, it no longer accurately reflects the state of the world. This poses a challenge, as robotics applications require minimal reaction times, such that maneuvers can be quickly planned in the event of a safety-critical situation. In this paper we propose StrObe, a novel approach that minimizes latency by ingesting LiDAR packets and emitting a stream of detections without waiting for the full sweep to be built. StrObe reuses computations from previous packets and iteratively updates a latent spatial representation of the scene, which acts as a memory, as new evidence comes in, resulting in accurate low-latency perception. We demonstrate the effectiveness of our approach on a large scale real-world dataset, showing that StrObe far outperforms the state-of-the-art when latency is taken into account, and matches the performance in the traditional setting.
In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and … In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and 3D object detection as well as ground estimation and depth completion. Our experiments show that all these tasks are complementary and help the network learn better representations by fusing information at various levels. Importantly, our approach leads the KITTI benchmark on 2D, 3D and BEV object detection, while being real time.
Federated learning (FL) enables distributed agents to collaboratively learn a centralized model without sharing their raw data with each other. However, data locality does not provide sufficient privacy protection, and … Federated learning (FL) enables distributed agents to collaboratively learn a centralized model without sharing their raw data with each other. However, data locality does not provide sufficient privacy protection, and it is desirable to facilitate FL with rigorous differential privacy (DP) guarantee. Existing DP mechanisms would introduce random noise with magnitude proportional to the model size, which can be quite large in deep neural networks. In this paper, we propose a new FL framework with sparsification-amplified privacy. Our approach integrates random sparsification with gradient perturbation on each agent to amplify privacy guarantee. Since sparsification would increase the number of communication rounds required to achieve a certain target accuracy, which is unfavorable for DP guarantee, we further introduce acceleration techniques to help reduce the privacy cost. We rigorously analyze the convergence of our approach and utilize Renyi DP to tightly account the end-to-end DP guarantee. Extensive experiments on benchmark datasets validate that our approach outperforms previous differentially-private FL approaches in both privacy guarantee and communication efficiency.
Our goal is to significantly speed up the runtime of current state-of-the-art stereo algorithms to enable real-time inference. Towards this goal, we developed a differentiable PatchMatch module that allows us … Our goal is to significantly speed up the runtime of current state-of-the-art stereo algorithms to enable real-time inference. Towards this goal, we developed a differentiable PatchMatch module that allows us to discard most disparities without requiring full cost volume evaluation. We then exploit this representation to learn which range to prune for each pixel. By progressively reducing the search space and effectively propagating such information, we are able to efficiently compute the cost volume for high likelihood hypotheses and achieve savings in both memory and computation.Finally, an image guided refinement module is exploited to further improve the performance. Since all our components are differentiable, the full network can be trained end-to-end. Our experiments show that our method achieves competitive results on KITTI and SceneFlow datasets while running in real-time at 62ms.
Alternating Direction Method of Multipliers (ADMM) is a widely used tool for machine learning in distributed settings, where a machine learning model is trained over distributed data sources through an … Alternating Direction Method of Multipliers (ADMM) is a widely used tool for machine learning in distributed settings, where a machine learning model is trained over distributed data sources through an interactive process of local computation and message passing. Such an iterative process could cause privacy concerns of data owners. The goal of this paper is to provide differential privacy for ADMM-based distributed machine learning. Prior approaches on differentially private ADMM exhibit low utility under high privacy guarantee and often assume the objective functions of the learning problems to be smooth and strongly convex. To address these concerns, we propose a novel differentially private ADMM-based distributed learning algorithm called DP-ADMM, which combines an approximate augmented Lagrangian function with time-varying Gaussian noise addition in the iterative process to achieve higher utility for general objective functions under the same differential privacy guarantee. We also apply the moments accountant method to bound the end-to-end privacy loss. The theoretical analysis shows that DP-ADMM can be applied to a wider class of distributed learning problems, is provably convergent, and offers an explicit utility-privacy tradeoff. To our knowledge, this is the first paper to provide explicit convergence and utility properties for differentially private ADMM-based distributed learning algorithms. The evaluation results demonstrate that our approach can achieve good convergence and model accuracy under high end-to-end differential privacy guarantee.
In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and … In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and 3D object detection as well as ground estimation and depth completion. Our experiments show that all these tasks are complementary and help the network learn better representations by fusing information at various levels. Importantly, our approach leads the KITTI benchmark on 2D, 3D and bird's eye view object detection, while being real-time.
In this paper we tackle the problem of scene flow estimation in the context of self-driving. We leverage deep learning techniques as well as strong priors as in our application … In this paper we tackle the problem of scene flow estimation in the context of self-driving. We leverage deep learning techniques as well as strong priors as in our application domain the motion of the scene can be composed by the motion of the robot and the 3D motion of the actors in the scene. We formulate the problem as energy minimization in a deep structured model, which can be solved efficiently in the GPU by unrolling a Gaussian-Newton solver. Our experiments in the challenging KITTI scene flow dataset show that we outperform the state-of-the-art by a very large margin, while being 800 times faster.
In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design … In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolving the conflicts between semantic and instance segmentation. Besides, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves state-of-the-art performance with much faster inference. Code has been made available at: https://github.com/uber-research/UPSNet.
Electronic tongues based on potentiometry offer the prospect of rapid and continuous chemical fingerprinting for portable and remote systems. The present contribution presents a technology platform including a miniaturized electronic … Electronic tongues based on potentiometry offer the prospect of rapid and continuous chemical fingerprinting for portable and remote systems. The present contribution presents a technology platform including a miniaturized electronic tongue based on electropolymerized ion-sensitive films, microcontroller-based data acquisition, a smartphone interface and cloud computing back-end for data storage and deployment of machine learning models. The sensor array records a series of differential voltages without use of a true reference electrode and the resulting time-series potentiometry data is used to train supervised machine learning algorithms. For trained systems, inferencing tasks such as the classification of liquids are realized within less than 1 minute including data acquisition at the edge and inference using the cloud-deployed machine learning model. Preliminary demonstration of the complete electronic tongue technology stack is reported for the classification of beverages and mineral water.
In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design … In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolve the conflicts between semantic and instance segmentation. Additionally, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves state-of-the-art performance with much faster inference. Code has been made available at: https://github.com/uber-research/UPSNet
In this paper we tackle the problem of scene flow estimation in the context of self-driving. We leverage deep learning techniques as well as strong priors as in our application … In this paper we tackle the problem of scene flow estimation in the context of self-driving. We leverage deep learning techniques as well as strong priors as in our application domain the motion of the scene can be composed by the motion of the robot and the 3D motion of the actors in the scene. We formulate the problem as energy minimization in a deep structured model, which can be solved efficiently in the GPU by unrolling a Gaussian-Newton solver. Our experiments in the challenging KITTI scene flow dataset show that we outperform the state-of-the-art by a very large margin, while being 800 times faster.
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly … Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers - 8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
In this paper we show that High-Definition (HD) maps provide strong priors that can boost the performance and robustness of modern 3D object detectors. Towards this goal, we design a … In this paper we show that High-Definition (HD) maps provide strong priors that can boost the performance and robustness of modern 3D object detectors. Towards this goal, we design a single stage detector that extracts geometric and semantic features from the HD maps. As maps might not be available everywhere, we also propose a map prediction module that estimates the map on the fly from raw LiDAR data. We conduct extensive experiments on KITTI as well as a large-scale 3D detection benchmark containing 1 million frames, and show that the proposed map-aware detector consistently outperforms the state-of-the-art in both mapped and un-mapped scenarios. Importantly the whole framework runs at 20 frames per second.
We quantitatively investigate how machine learning models leak information about the individual data records on which they were trained. We focus on the basic membership inference attack: given a data … We quantitatively investigate how machine learning models leak information about the individual data records on which they were trained. We focus on the basic membership inference attack: given a data record and black-box access to a model, determine if the record was in the model's training dataset. To perform membership inference against a target model, we make adversarial use of machine learning and train our own inference model to recognize differences in the target model's predictions on the inputs that it trained on versus the inputs that it did not train on. We empirically evaluate our inference techniques on classification models trained by commercial "machine learning as a service" providers such as Google and Amazon. Using realistic datasets and classification tasks, including a hospital discharge dataset whose membership is sensitive from the privacy perspective, we show that these models can be vulnerable to membership inference attacks. We then investigate the factors that influence this leakage and evaluate mitigation strategies.
Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But pyramid representations have been avoided in recent object detectors that are based on deep … Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But pyramid representations have been avoided in recent object detectors that are based on deep convolutional networks, partially because they are slow to compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.
Machine learning techniques based on neural networks are achieving remarkable results in a wide variety of domains. Often, the training of models requires large, representative datasets, which may be crowdsourced … Machine learning techniques based on neural networks are achieving remarkable results in a wide variety of domains. Often, the training of models requires large, representative datasets, which may be crowdsourced and contain sensitive information. The models should not expose private information in these datasets. Addressing this goal, we develop new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy. Our implementation and experiments demonstrate that we can train deep neural networks with non-convex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality.
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has … We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.
We address the problem of real-time 3D object detection from point clouds in the context of autonomous driving. Speed is critical as detection is a necessary component for safety. Existing … We address the problem of real-time 3D object detection from point clouds in the context of autonomous driving. Speed is critical as detection is a necessary component for safety. Existing approaches are, however, expensive in computation due to high dimensionality of point clouds. We utilize the 3D data more efficiently by representing the scene from the Bird's Eye View (BEV), and propose PIXOR, a proposal-free, single-stage detector that outputs oriented 3D object estimates decoded from pixel-wise neural network predictions. The input representation, network architecture, and model optimization are specially designed to balance high accuracy and real-time efficiency. We validate PIXOR on two datasets: the KITTI BEV object detection benchmark, and a large-scale 3D vehicle detection benchmark. In both datasets we show that the proposed detector surpasses other state-of-the-art methods notably in terms of Average Precision (AP), while still runs at 10 FPS.
This paper aims at high-accuracy 3D object detection in autonomous driving scenario. We propose Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images … This paper aims at high-accuracy 3D object detection in autonomous driving scenario. We propose Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes. We encode the sparse 3D point cloud with a compact multi-view representation. The network is composed of two subnetworks: one for 3D object proposal generation and another for multi-view feature fusion. The proposal network generates 3D candidate boxes efficiently from the birds eye view representation of 3D point cloud. We design a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths. Experiments on the challenging KITTI benchmark show that our approach outperforms the state-of-the-art by around 25% and 30% AP on the tasks of 3D localization and 3D detection. In addition, for 2D detection, our approach obtains 14.9% higher AP than the state-of-the-art on the hard data among the LIDAR-based methods.
Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR … Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.
In this paper, we propose a neural motion planner for learning to drive autonomously in complex urban scenarios that include traffic-light handling, yielding, and interactions with multiple road-users. Towards this … In this paper, we propose a neural motion planner for learning to drive autonomously in complex urban scenarios that include traffic-light handling, yielding, and interactions with multiple road-users. Towards this goal, we design a holistic model that takes as input raw LIDAR data and a HD map and produces interpretable intermediate representations in the form of 3D detections and their future trajectories, as well as a cost volume defining the goodness of each position that the self-driving car can take within the planning horizon. We then sample a set of diverse physically possible trajectories and choose the one with the minimum learned cost. Importantly, our cost volume is able to naturally capture multi-modality. We demonstrate the effectiveness of our approach in real-world driving data captured in several cities in North America. Our experiments show that the learned cost volume can generate safer planning than all the baselines.
Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. In this paper, we consider the problem of encoding a point cloud into … Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. In this paper, we consider the problem of encoding a point cloud into a format appropriate for a downstream detection pipeline. Recent literature suggests two types of encoders; fixed encoders tend to be fast but sacrifice accuracy, while encoders that are learned from data are more accurate, but slower. In this work, we propose PointPillars, a novel encoder which utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars). While the encoded features can be used with any standard 2D convolutional detection architecture, we further propose a lean downstream network. Extensive experimentation shows that PointPillars outperforms previous encoders with respect to both speed and accuracy by a large margin. Despite only using lidar, our full detection pipeline significantly outperforms the state of the art, even among fusion methods, with respect to both the 3D and bird's eye view KITTI benchmarks. This detection performance is achieved while running at 62 Hz: a 2 - 4 fold runtime improvement. A faster version of our method matches the state of the art at 105 Hz. These benchmarks suggest that PointPillars is an appropriate encoding for object detection in point clouds.
In this work, we study 3D object detection from RGBD data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D … In this work, we study 3D object detection from RGBD data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability.
Recent work has shown that depth estimation from a stereo pair of images can be formulated as a supervised learning task to be resolved with convolutional neural networks (CNNs). However, … Recent work has shown that depth estimation from a stereo pair of images can be formulated as a supervised learning task to be resolved with convolutional neural networks (CNNs). However, current architectures rely on patch-based Siamese networks, lacking the means to exploit context information for finding correspondence in ill-posed regions. To tackle this problem, we propose PSMNet, a pyramid stereo matching network consisting of two main modules: spatial pyramid pooling and 3D CNN. The spatial pyramid pooling module takes advantage of the capacity of global context information by aggregating context in different scales and locations to form a cost volume. The 3D CNN learns to regularize cost volume using stacked multiple hourglass networks in conjunction with intermediate supervision. The proposed approach was evaluated on several benchmark datasets. Our method ranked first in the KITTI 2012 and 2015 leaderboards before March 18, 2018. The codes of PSMNet are available at: https://github.com/JiaRenChang/PSMNet.
We propose a novel deep learning architecture for regressing disparity from a rectified pair of stereo images. We leverage knowledge of the problem's geometry to form a cost volume using … We propose a novel deep learning architecture for regressing disparity from a rectified pair of stereo images. We leverage knowledge of the problem's geometry to form a cost volume using deep feature representations. We learn to incorporate contextual information using 3-D convolutions over this volume. Disparity values are regressed from the cost volume using a proposed differentiable soft argmin operation, which allows us to train our method end-to-end to sub-pixel accuracy without any additional post-processing or regularization. We evaluate our method on the Scene Flow and KITTI datasets and on KITTI we set a new stateof-the-art benchmark, while being significantly faster than competing approaches.
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, … State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network(RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features-using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5 fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.
Alternating Direction Method of Multipliers (ADMM) is a widely used tool for machine learning in distributed settings, where a machine learning model is trained over distributed data sources through an … Alternating Direction Method of Multipliers (ADMM) is a widely used tool for machine learning in distributed settings, where a machine learning model is trained over distributed data sources through an interactive process of local computation and message passing. Such an iterative process could cause privacy concerns of data owners. The goal of this paper is to provide differential privacy for ADMM-based distributed machine learning. Prior approaches on differentially private ADMM exhibit low utility under high privacy guarantee and often assume the objective functions of the learning problems to be smooth and strongly convex. To address these concerns, we propose a novel differentially private ADMM-based distributed learning algorithm called DP-ADMM, which combines an approximate augmented Lagrangian function with time-varying Gaussian noise addition in the iterative process to achieve higher utility for general objective functions under the same differential privacy guarantee. We also apply the moments accountant method to bound the end-to-end privacy loss. The theoretical analysis shows that DP-ADMM can be applied to a wider class of distributed learning problems, is provably convergent, and offers an explicit utility-privacy tradeoff. To our knowledge, this is the first paper to provide explicit convergence and utility properties for differentially private ADMM-based distributed learning algorithms. The evaluation results demonstrate that our approach can achieve good convergence and model accuracy under high end-to-end differential privacy guarantee.
In this paper, we propose PointRCNN for 3D object detection from raw point cloud. The whole framework is composed of two stages: stage-1 for the bottom-up 3D proposal generation and … In this paper, we propose PointRCNN for 3D object detection from raw point cloud. The whole framework is composed of two stages: stage-1 for the bottom-up 3D proposal generation and stage-2 for refining proposals in the canonical coordinates to obtain the final detection results. Instead of generating proposals from RGB image or projecting point cloud to bird's view or voxels as previous methods do, our stage-1 sub-network directly generates a small number of high-quality 3D proposals from point cloud in a bottom-up manner via segmenting the point cloud of the whole scene into foreground points and background. The stage-2 sub-network transforms the pooled points of each proposal to canonical coordinates to learn better local spatial features, which is combined with global semantic features of each point learned in stage-1 for accurate box refinement and confidence prediction. Extensive experiments on the 3D detection benchmark of KITTI dataset show that our proposed architecture outperforms state-of-the-art methods with remarkable margins by using only point cloud as input. The code is available at https://github.com/sshaoshuai/PointRCNN.
In this paper we propose a novel deep neural network that is able to jointly reason about 3D detection, tracking and motion forecasting given data captured by a 3D sensor. … In this paper we propose a novel deep neural network that is able to jointly reason about 3D detection, tracking and motion forecasting given data captured by a 3D sensor. By jointly reasoning about these tasks, our holistic approach is more robust to occlusion as well as sparse data at range. Our approach performs 3D convolutions across space and time over a bird's eye view representation of the 3D world, which is very efficient in terms of both memory and computation. Our experiments on a new very large scale dataset captured in several north american cities, show that we can outperform the state-of-the-art by a large margin. Importantly, by sharing computation we can perform all tasks in as little as 30 ms.
In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an … In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an end-to-end learnable architecture that exploits continuous convolutions to fuse image and LIDAR feature maps at different levels of resolution. Our proposed continuous fusion layer encode both discrete-state image features as well as continuous geometric information. This enables us to design a novel, reliable and efficient end-to-end learnable 3D object detector based on multiple sensors. Our experimental evaluation on both KITTI as well as a large scale 3D object detection benchmark shows significant improvements over the state of the art.
Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling … Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.
Our goal is to train a policy for autonomous driving via imitation learning that is robust enough to drive a real vehicle.We find that standard behavior cloning is insufficient for … Our goal is to train a policy for autonomous driving via imitation learning that is robust enough to drive a real vehicle.We find that standard behavior cloning is insufficient for handling complex driving scenarios, even when we leverage a perception system for preprocessing the input and a controller for executing the output on the car: 30 million examples are still not enough.We propose exposing the learner to synthesized data in the form of perturbations to the expert's driving, which creates interesting situations such as collisions and/or going off the road.Rather than purely imitating all data, we augment the imitation loss with additional losses that penalize undesirable events and encourage progress -the perturbations then provide an important signal for these losses and lead to robustness of the learned model.We show that the ChauffeurNet model can handle complex situations in simulation, and present ablation experiments that emphasize the importance of each of our proposed changes and show that the model is responding to the appropriate causal factors.Finally, we demonstrate the model driving a real car at our test facility.
The goal of this paper is to perform 3D object detection in the context of autonomous driving. Our method aims at generating a set of high-quality 3D object proposals by … The goal of this paper is to perform 3D object detection in the context of autonomous driving. Our method aims at generating a set of high-quality 3D object proposals by exploiting stereo imagery. We formulate the problem as minimizing an energy function that encodes object size priors, placement of objects on the ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. We then exploit a CNN on top of these proposals to perform object detection. In particular, we employ a convolutional neural net (CNN) that exploits context and depth information to jointly regress to 3D bounding box coordinates and object pose. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. When combined with the CNN, our approach outperforms all existing results in object detection and orientation estimation tasks for all three KITTI object classes. Furthermore, we experiment also with the setting where LIDAR information is available, and show that using both LIDAR and stereo leads to the best result.
State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks that had originally been designed for image classification. However, dense prediction and image classification are structurally different. In … State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks that had originally been designed for image classification. However, dense prediction and image classification are structurally different. In this work, we develop a new convolutional network module that is specifically designed for dense prediction. The presented module uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution. The architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage. We show that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems. In addition, we examine the adaptation of image classification networks to dense prediction and show that simplifying the adapted network can increase accuracy.
Few prior works study deep learning on point sets. PointNet by Qi et al. is a pioneer in this direction. However, by design PointNet does not capture local structures induced … Few prior works study deep learning on point sets. PointNet by Qi et al. is a pioneer in this direction. However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. In this work, we introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales. With further observation that point sets are usually sampled with varying densities, which results in greatly decreased performance for networks trained on uniform densities, we propose novel set learning layers to adaptively combine features from multiple scales. Experiments show that our network called PointNet++ is able to learn deep point set features efficiently and robustly. In particular, results significantly better than state-of-the-art have been obtained on challenging benchmarks of 3D point clouds.
3D object detection is an essential task in autonomous driving. Recent techniques excel with highly accurate detection rates, provided the 3D input data is obtained from precise but expensive LiDAR … 3D object detection is an essential task in autonomous driving. Recent techniques excel with highly accurate detection rates, provided the 3D input data is obtained from precise but expensive LiDAR technology. Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in drastically lower accuracies --- a gap that is commonly attributed to poor image-based depth estimation. However, in this paper we argue that it is not the quality of the data but its representation that accounts for the majority of the difference. Taking the inner workings of convolutional neural networks into consideration, we propose to convert image-based depth maps to pseudo-LiDAR representations --- essentially mimicking the LiDAR signal. With this representation we can apply different existing LiDAR-based detection algorithms. On the popular KITTI benchmark, our approach achieves impressive improvements over the existing state-of-the-art in image-based performance --- raising the detection accuracy of objects within the 30m range from the previous state-of-the-art of 22% to an unprecedented 74%. At the time of submission our algorithm holds the highest entry on the KITTI 3D object detection leaderboard for stereo-image-based approaches.
In this paper, we present LaserNet, a computationally efficient method for 3D object detection from LiDAR data for autonomous driving. The efficiency results from processing LiDAR data in the native … In this paper, we present LaserNet, a computationally efficient method for 3D object detection from LiDAR data for autonomous driving. The efficiency results from processing LiDAR data in the native range view of the sensor, where the input data is naturally compact. Operating in the range view involves well known challenges for learning, including occlusion and scale variation, but it also provides contextual information based on how the sensor data was captured. Our approach uses a fully convolutional network to predict a multimodal distribution over 3D boxes for each point and then it efficiently fuses these distributions to generate a prediction for each object. Experiments show that modeling each detection as a distribution rather than a single deterministic box leads to better overall detection performance. Benchmark results show that this approach has significantly lower runtime than other recent detectors and that it achieves state-of-the-art performance when compared on a large dataset that has enough data to overcome the challenges of training on the range view.
In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and … In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and 3D object detection as well as ground estimation and depth completion. Our experiments show that all these tasks are complementary and help the network learn better representations by fusing information at various levels. Importantly, our approach leads the KITTI benchmark on 2D, 3D and bird's eye view object detection, while being real-time.
We present a compact but effective CNN model for optical flow, called PWC-Net. PWC-Net has been designed according to simple and well-established principles: pyramidal processing, warping, and the use of … We present a compact but effective CNN model for optical flow, called PWC-Net. PWC-Net has been designed according to simple and well-established principles: pyramidal processing, warping, and the use of a cost volume. Cast in a learnable feature pyramid, PWC-Net uses the current optical flow estimate to warp the CNN features of the second image. It then uses the warped features and features of the first image to construct a cost volume, which is processed by a CNN to estimate the optical flow. PWC-Net is 17 times smaller in size and easier to train than the recent FlowNet2 model. Moreover, it outperforms all published optical flow methods on the MPI Sintel final pass and KITTI 2015 benchmarks, running at about 35 fps on Sintel resolution (1024 Ã × 436) images. Our models are available on our project website.
Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of … Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes. To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations, 20 000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.
Creating high definition maps that contain precise information of static elements of the scene is of utmost importance for enabling self driving cars to drive safely. In this paper, we … Creating high definition maps that contain precise information of static elements of the scene is of utmost importance for enabling self driving cars to drive safely. In this paper, we tackle the problem of drivable road boundary extraction from LiDAR and camera imagery. Towards this goal, we design a structured model where a fully convolutional network obtains deep features encoding the location and direction of road boundaries and then, a convolutional recurrent network outputs a polyline representation for each one of them. Importantly, our method is fully automatic and does not require a user in the loop. We showcase the effectiveness of our method on a large North American city where we obtain perfect topology of road boundaries 99.3% of the time at a high precision and recall.
One of the fundamental challenges to scale self-driving is being able to create accurate high definition maps (HD maps) with low cost. Current attempts to automate this process typically focus … One of the fundamental challenges to scale self-driving is being able to create accurate high definition maps (HD maps) with low cost. Current attempts to automate this process typically focus on simple scenarios, estimate independent maps per frame or do not have the level of precision required by modern self driving vehicles. In contrast, in this paper we focus on drawing the lane boundaries of complex highways with many lanes that contain topology changes due to forks and merges. Towards this goal, we formulate the problem as inference in a directed acyclic graphical model (DAG), where the nodes of the graph encode geometric and topological properties of the local regions of the lane boundaries. Since we do not know a priori the topology of the lanes, we also infer the DAG topology (i.e., nodes and edges) for each region. We demonstrate the effectiveness of our approach on two major North American Highways in two different states and show high precision and recall as well as 89% correct topology.
Recognition algorithms based on convolutional networks (CNNs) typically use the output of the last layer as a feature representation. However, the information in this layer may be too coarse spatially … Recognition algorithms based on convolutional networks (CNNs) typically use the output of the last layer as a feature representation. However, the information in this layer may be too coarse spatially to allow precise localization. On the contrary, earlier layers may be precise in localization but will not capture semantics. To get the best of both worlds, we define the hypercolumn at a pixel as the vector of activations of all CNN units above that pixel. Using hypercolumns as pixel descriptors, we show results on three fine-grained localization tasks: simultaneous detection and segmentation [22], where we improve state-of-the-art from 49.7 mean AP <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">r</sup> [22] to 60.0, keypoint localization, where we get a 3.3 point boost over [20], and part labeling, where we show a 6.6 point gain over a strong baseline.
In this paper, we tackle the problem of online road network extraction from sparse 3D point clouds. Our method is inspired by how an annotator builds a lane graph, by … In this paper, we tackle the problem of online road network extraction from sparse 3D point clouds. Our method is inspired by how an annotator builds a lane graph, by first identifying how many lanes there are and then drawing each one in turn. We develop a hierarchical recurrent network that attends to initial regions of a lane boundary and traces them out completely by outputting a structured poly-line. We also propose a novel differentiable loss function that measures the deviation of the edges of the ground truth polylines and their predictions. This is more suitable than distances on vertices, as there exists many ways to draw equivalent polylines. We demonstrate the effectiveness of our method on a 90 km stretch of highway, and show that we can recover the right topology 92% of the time.
We demonstrate that it is possible to train large recurrent language models with user-level differential privacy guarantees with only a negligible cost in predictive accuracy. Our work builds on recent … We demonstrate that it is possible to train large recurrent language models with user-level differential privacy guarantees with only a negligible cost in predictive accuracy. Our work builds on recent advances in the training of deep networks on user-partitioned data and privacy accounting for stochastic gradient descent. In particular, we add user-level privacy protection to the federated averaging algorithm, which makes "large step" updates from user-level data. Our work demonstrates that given a dataset with a sufficiently large number of users (a requirement easily met by even small internet-scale datasets), achieving differential privacy comes at the cost of increased computation, rather than in decreased utility as in most prior work. We find that our private LSTM language models are quantitatively and qualitatively similar to un-noised models when trained on a large dataset.
Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but … Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at this https URL
Autonomous driving presents one of the largest problems that the robotics and artificial intelligence communities are facing at the moment, both in terms of difficulty and potential societal impact. Self-driving … Autonomous driving presents one of the largest problems that the robotics and artificial intelligence communities are facing at the moment, both in terms of difficulty and potential societal impact. Self-driving vehicles (SDVs) are expected to prevent road accidents and save millions of lives while improving the livelihood and life quality of many more. However, despite large interest and a number of industry players working in the autonomous domain, there still remains more to be done in order to develop a system capable of operating at a level comparable to best human drivers. One reason for this is high uncertainty of traffic behavior and large number of situations that an SDV may encounter on the roads, making it very difficult to create a fully generalizable system. To ensure safe and efficient operations, an autonomous vehicle is required to account for this uncertainty and to anticipate a multitude of possible behaviors of traffic actors in its surrounding. We address this critical problem and present a method to predict multiple possible trajectories of actors while also estimating their probabilities. The method encodes each actor's surrounding context into a raster image, used as input by deep convolutional networks to automatically derive relevant features for the task. Following extensive offline evaluation and comparison to state-of-the-art baselines, the method was successfully tested on SDVs in closed-course tests.
In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter's field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, … In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter's field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales, with image-level features encoding global context and further boost performance. We also elaborate on implementation details and share our experience on training our system. The proposed `DeepLabv3' system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.
The topic of semantic segmentation has witnessed considerable progress due to the powerful features learned by convolutional neural networks (CNNs) [13]. The current leading approaches for semantic segmentation exploit shape … The topic of semantic segmentation has witnessed considerable progress due to the powerful features learned by convolutional neural networks (CNNs) [13]. The current leading approaches for semantic segmentation exploit shape information by extracting CNN features from masked image regions. This strategy introduces artificial boundaries on the images and may impact the quality of the extracted features. Besides, the operations on the raw image domain require to compute thousands of networks on a single image, which is time-consuming. In this paper, we propose to exploit shape information via masking convolutional features. The proposal segments (e.g., super-pixels) are treated as masks on the convolutional feature maps. The CNN features of segments are directly masked out from these maps and used to train classifiers for recognition. We further propose a joint method to handle objects and "stuff" (e.g., grass, sky, water) in the same framework. State-of-the-art results are demonstrated on benchmarks of PASCAL VOC and new PASCAL-CONTEXT, with a compelling computational speed.
In this paper, we tackle the problem of relational behavior forecasting from sensor data. Towards this goal, we propose a novel spatially-aware graph neural network (SpAGNN) that models the interactions … In this paper, we tackle the problem of relational behavior forecasting from sensor data. Towards this goal, we propose a novel spatially-aware graph neural network (SpAGNN) that models the interactions between agents in the scene. Specifically, we exploit a convolutional neural network to detect the actors and compute their initial states. A graph neural network then iteratively updates the actor states via a message passing process. Inspired by Gaussian belief propagation, we design the messages to be spatially-transformed parameters of the output distributions from neighboring agents. Our model is fully differentiable, thus enabling end-to-end training. Importantly, our probabilistic predictions can model uncertainty at the trajectory level. We demonstrate the effectiveness of our approach by achieving significant improvements over the state-of-the-art on two real-world self-driving datasets: ATG4D and nuScenes.
Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in their building modules. In this work, we introduce two new modules to … Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in their building modules. In this work, we introduce two new modules to enhance the transformation modeling capability of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from the target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation, giving rise to deformable convolutional networks. Extensive experiments validate the performance of our approach. For the first time, we show that learning dense spatial transformation in deep CNNs is effective for sophisticated vision tasks such as object detection and semantic segmentation. The code is released at https://github.com/msracver/Deformable-ConvNets.
Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, … Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.
Few ideas have enjoyed as large an impact on deep learning as convolution. For any problem involving pixels or spatial representations, common intuition holds that convolutional neural networks may be … Few ideas have enjoyed as large an impact on deep learning as convolution. For any problem involving pixels or spatial representations, common intuition holds that convolutional neural networks may be appropriate. In this paper we show a striking counterexample to this intuition via the seemingly trivial coordinate transform problem, which simply requires learning a mapping between coordinates in (x,y) Cartesian space and one-hot pixel space. Although convolutional networks would seem appropriate for this task, we show that they fail spectacularly. We demonstrate and carefully analyze the failure first on a toy problem, at which point a simple fix becomes obvious. We call this solution CoordConv, which works by giving convolution access to its own input coordinates through the use of extra coordinate channels. Without sacrificing the computational and parametric efficiency of ordinary convolution, CoordConv allows networks to learn either complete translation invariance or varying degrees of translation dependence, as required by the end task. CoordConv solves the coordinate transform problem with perfect generalization and 150 times faster with 10--100 times fewer parameters than convolution. This stark contrast raises the question: to what extent has this inability of convolution persisted insidiously inside other tasks, subtly hampering performance from within? A complete answer to this question will require further investigation, but we show preliminary evidence that swapping convolution for CoordConv can improve models on a diverse set of tasks. Using CoordConv in a GAN produced less mode collapse as the transform between high-level spatial latents and pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST showed 24% better IOU when using CoordConv, and in the RL domain agents playing Atari games benefit significantly from the use of CoordConv layers.
We present a method for extracting depth information from a rectified image pair. Our approach focuses on the first stage of many stereo algorithms: the matching cost computation. We approach … We present a method for extracting depth information from a rectified image pair. Our approach focuses on the first stage of many stereo algorithms: the matching cost computation. We approach the problem by learning a similarity measure on small image patches using a convolutional neural network. Training is carried out in a supervised manner by constructing a binary classification data set with examples of similar and dissimilar pairs of patches. We examine two network architectures for this task: one tuned for speed, the other for accuracy. The output of the convolutional neural network is used to initialize the stereo matching cost. A series of post-processing steps follow: cross-based cost aggregation, semiglobal matching, a left-right consistency check, subpixel enhancement, a median filter, and a bilateral filter. We evaluate our method on the KITTI 2012, KITTI 2015, and Middlebury stereo data sets and show that it outperforms other approaches on all three data sets.
The way that information propagates in neural networks is of great importance. In this paper, we propose Path Aggregation Network (PANet) aiming at boosting information flow in proposal-based instance segmentation … The way that information propagates in neural networks is of great importance. In this paper, we propose Path Aggregation Network (PANet) aiming at boosting information flow in proposal-based instance segmentation framework. Specifically, we enhance the entire feature hierarchy with accurate localization signals in lower layers by bottom-up path augmentation, which shortens the information path between lower layers and topmost feature. We present adaptive feature pooling, which links feature grid and all feature levels to make useful information in each level propagate directly to following proposal subnetworks. A complementary branch capturing different views for each proposal is created to further improve mask prediction. These improvements are simple to implement, with subtle extra computational overhead. Yet they are useful and make our PANet reach the 1st place in the COCO 2017 Challenge Instance Segmentation task and the 2nd place in Object Detection task without large-batch training. PANet is also state-of-the-art on MVD and Cityscapes.
We propose a novel deep learning architecture for regressing disparity from a rectified pair of stereo images. We leverage knowledge of the problem's geometry to form a cost volume using … We propose a novel deep learning architecture for regressing disparity from a rectified pair of stereo images. We leverage knowledge of the problem's geometry to form a cost volume using deep feature representations. We learn to incorporate contextual information using 3-D convolutions over this volume. Disparity values are regressed from the cost volume using a proposed differentiable soft argmin operation, which allows us to train our method end-to-end to sub-pixel accuracy without any additional post-processing or regularization. We evaluate our method on the Scene Flow and KITTI datasets and on KITTI we set a new state-of-the-art benchmark, while being significantly faster than competing approaches.
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. … Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional networks achieve improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.
In this paper we show how to learn directly from image data (i.e., without resorting to manually-designed features) a general similarity function for comparing image patches, which is a task … In this paper we show how to learn directly from image data (i.e., without resorting to manually-designed features) a general similarity function for comparing image patches, which is a task of fundamental importance for many computer vision problems. To encode such a function, we opt for a CNN-based model that is trained to account for a wide variety of changes in image appearance. To that end, we explore and study multiple neural network architectures, which are specifically adapted to this task. We show that such an approach can significantly outperform the state-of-the-art on several problems and benchmark datasets.