Author Description

Login to generate an author description

Ask a Question About This Mathematician

All published works (39)

Building large AI fleets to support the rapidly growing DL workloads is an active research topic for modern cloud providers. Generating accurate benchmarks plays an essential role in designing the … Building large AI fleets to support the rapidly growing DL workloads is an active research topic for modern cloud providers. Generating accurate benchmarks plays an essential role in designing the fast-paced software and hardware solutions in this space. Two fundamental challenges to make this scalable are (i) workload representativeness and (ii) the ability to quickly incorporate changes to the fleet into the benchmarks.
As deep learning models and input data continue to scale at an unprecedented rate, it has become inevitable to move towards distributed training platforms to fit the models and increase … As deep learning models and input data continue to scale at an unprecedented rate, it has become inevitable to move towards distributed training platforms to fit the models and increase training throughput. State-of-the-art distributed training systems are adopting emerging approaches and techniques such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and optimized parallelization strategies. This results in a complex software/hardware co-design stack, necessitating a modeling/simulation infrastructure for design-space exploration. This paper introduces ASTRA-sim2.0, which extends the open-source ASTRA-sim infrastructure with capabilities to model state-of-the-art and emerging distributed training models and platforms. Specifically, we enable ASTRAsim to (i) support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) implement a parameterizable multi-dimensional heterogeneous topology generation infrastructure with the capability to simulate target systems at scale through analytical performance estimation, and (iii) enhance memory system modeling to support accurate modeling of in-network collective communication and disaggregated memory systems. With these capabilities, we conduct comprehensive case studies targeting emerging distributed models and platforms. ASTRA-sim2.0 enables system designers to swiftly traverse the complex co-design stack and gain meaningful insights when designing and deploying distributed training platforms at scale.
Building large AI fleets to support the rapidly growing DL workloads is an active research topic for modern cloud providers. Generating accurate benchmarks plays an essential role in designing the … Building large AI fleets to support the rapidly growing DL workloads is an active research topic for modern cloud providers. Generating accurate benchmarks plays an essential role in designing the fast-paced software and hardware solutions in this space. Two fundamental challenges to make this scalable are (i) workload representativeness and (ii) the ability to quickly incorporate changes to the fleet into the benchmarks. To overcome these issues, we propose Mystique, an accurate and scalable framework for production AI benchmark generation. It leverages the PyTorch execution trace (ET), a new feature that captures the runtime information of AI models at the granularity of operators, in a graph format, together with their metadata. By sourcing fleet ETs, we can build AI benchmarks that are portable and representative. Mystique is scalable, due to its lightweight data collection, in terms of runtime overhead and instrumentation effort. It is also adaptive because ET composability allows flexible control on benchmark creation. We evaluate our methodology on several production AI models, and show that benchmarks generated with Mystique closely resemble original AI models, both in execution time and system-level metrics. We also showcase the portability of the generated benchmarks across platforms, and demonstrate several use cases enabled by the fine-grained composability of the execution trace.
As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. … As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) we implement a parameterizable multi-dimensional heterogeneous topology generation infrastructure with analytical performance estimates enabling simulating target systems at scale, and (iii) we enhance the memory system modeling to support accurate modeling of in-network collective communication and disaggregated memory systems. With such capabilities, we run comprehensive case studies targeting emerging distributed models and platforms. This infrastructure lets system designers swiftly traverse the complex co-design stack and give meaningful insights when designing and deploying distributed training platforms at scale.
Benchmarking and co-design are essential for driving optimizations and innovation around ML models, ML software, and next-generation hardware. Full workload benchmarks, e.g. MLPerf, play an essential role in enabling fair … Benchmarking and co-design are essential for driving optimizations and innovation around ML models, ML software, and next-generation hardware. Full workload benchmarks, e.g. MLPerf, play an essential role in enabling fair comparison across different software and hardware stacks especially once systems are fully designed and deployed. However, the pace of AI innovation demands a more agile methodology to benchmark creation and usage by simulators and emulators for future system co-design. We propose Chakra, an open graph schema for standardizing workload specification capturing key operations and dependencies, also known as Execution Trace (ET). In addition, we propose a complementary set of tools/capabilities to enable collection, generation, and adoption of Chakra ETs by a wide range of simulators, emulators, and benchmarks. For instance, we use generative AI models to learn latent statistical properties across thousands of Chakra ETs and use these models to synthesize Chakra ETs. These synthetic ETs can obfuscate key proprietary information and also target future what-if scenarios. As an example, we demonstrate an end-to-end proof-of-concept that converts PyTorch ETs to Chakra ETs and uses this to drive an open-source training system simulator (ASTRA-sim). Our end-goal is to build a vibrant industry-wide ecosystem of agile benchmarks and tools to drive future AI system co-design.
Ahstract-RDMA over Converged Ethernet (RoCE) has gained significant attraction for datacenter networks due to its compatibility with conventional Ethernet-based fabric. However, the RDMA protocol is efficient only on (nearly) lossless … Ahstract-RDMA over Converged Ethernet (RoCE) has gained significant attraction for datacenter networks due to its compatibility with conventional Ethernet-based fabric. However, the RDMA protocol is efficient only on (nearly) lossless networks, emphasizing the vital role of congestion control on RoCE networks. Unfortunately, the native RoCE congestion control scheme, based on Priority Flow Control (PFC), suffers from many drawbacks such as unfairness, head-of-line-blocking, and deadlock. Therefore, in recent years many schemes have been proposed to provide additional congestion control for RoCE networks to minimize PFC drawbacks. However, these schemes are proposed for general datacenter environments. In contrast to the general datacenters that are built using commodity hardware and run general-purpose workloads, high-performance distributed training platforms deploy high-end accelerators and network components and exclusively run training workloads using collectives (All-Reduce, All-To-All) communication libraries for communication. Furthermore, these platforms usually have a private network, separating their communication traffic from the rest of the datacenter traffic. Scalable topology-aware collective algorithms are inherently designed to avoid incast patterns and balance traffic optimally. These distinct features necessitate revisiting previously proposed congestion control schemes for general-purpose datacenter environments. In this paper, we thoroughly analyze some of the state-of-the-art RoCE congestion control schemes (DCQCN, DCTCP, TIMELY, and HPCC) vs. PFC when running on distributed training platforms. Our results indicate that pre-viously proposed RoCE congestion control schemes have little impact on the end-to-end performance of training workloads, motivating the necessity of designing an optimized, yet low-overhead, congestion control scheme based on the characteristics of distributed training platforms and workloads.
Deep learning recommendation models (DLRMs) have been used across many business-critical services at Meta and are the single largest AI application in terms of infrastructure demand in its data-centers. In … Deep learning recommendation models (DLRMs) have been used across many business-critical services at Meta and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper, we present Neo, a software-hardware co-designed system for high-performance distributed training of large-scale DLRMs. Neo employs a novel 4D parallelism strategy that combines table-wise, row-wise, column-wise, and data parallelism for training massive embedding operators in DLRMs. In addition, Neo enables extremely high-performance and memory-efficient embedding computations using a variety of critical systems optimizations, including hybrid kernel fusion, software-managed caching, and quality-preserving compression. Finally, Neo is paired with ZionEX, a new hardware platform co-designed with Neo's 4D parallelism for optimizing communications for large-scale DLRM training. Our evaluation on 128 GPUs using 16 ZionEX nodes shows that Neo outperforms existing systems by up to 40× for training 12-trillion-parameter DLRM models deployed in production.
RDMA over Converged Ethernet (RoCE) has gained significant attraction for datacenter networks due to its compatibility with conventional Ethernet-based fabric. However, the RDMA protocol is efficient only on (nearly) lossless … RDMA over Converged Ethernet (RoCE) has gained significant attraction for datacenter networks due to its compatibility with conventional Ethernet-based fabric. However, the RDMA protocol is efficient only on (nearly) lossless networks, emphasizing the vital role of congestion control on RoCE networks. Unfortunately, the native RoCE congestion control scheme, based on Priority Flow Control (PFC), suffers from many drawbacks such as unfairness, head-of-line-blocking, and deadlock. Therefore, in recent years many schemes have been proposed to provide additional congestion control for RoCE networks to minimize PFC drawbacks. However, these schemes are proposed for general datacenter environments. In contrast to the general datacenters that are built using commodity hardware and run general-purpose workloads, high-performance distributed training platforms deploy high-end accelerators and network components and exclusively run training workloads using collectives (All-Reduce, All-To-All) communication libraries for communication. Furthermore, these platforms usually have a private network, separating their communication traffic from the rest of the datacenter traffic. Scalable topology-aware collective algorithms are inherently designed to avoid incast patterns and balance traffic optimally. These distinct features necessitate revisiting previously proposed congestion control schemes for general-purpose datacenter environments. In this paper, we thoroughly analyze some of the SOTA RoCE congestion control schemes vs. PFC when running on distributed training platforms. Our results indicate that previously proposed RoCE congestion control schemes have little impact on the end-to-end performance of training workloads, motivating the necessity of designing an optimized, yet low-overhead, congestion control scheme based on the characteristics of distributed training platforms and workloads.
The continuous growth in both size and training data for modern Deep Neural Networks (DNNs) models has led to training tasks taking days or even months. Distributed training is a … The continuous growth in both size and training data for modern Deep Neural Networks (DNNs) models has led to training tasks taking days or even months. Distributed training is a solution to reduce training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In today's datacenters, for training at scale, NPUs are connected through multi-dimensional interconnection links with different bandwidth and latency. Hence, keeping all network dimensions busy and maximizing the network BW is a challenging task in such a hybrid network environment, as this work identifies. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of single All-Reduce by 1.88x (2.92x max), and improve the end-to-end training iteration performance of real workloads such as ResNet-50, GNMT, DLRM, and Transformer- 1T by 1.49x (1.96x max), 1.41x (1.81x max), 1.42x (1.80x max), and 1.35x (1.78x max), respectively.
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in … Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is quite challenging. This is because there is a pernicious balance between using the accelerator's compute and memory for both DL computations and communication. This work makes two key contributions. First, via real system measurements and detailed modeling, we provide an understanding of compute and memory bandwidth demands for DL compute and comms. Second, we propose a novel DL collective communication accelerator called Accelerator Collectives Engine (ACE) that sits alongside the compute and networking engines at the accelerator endpoint. ACE frees up the endpoint's compute and memory resources for DL compute, which in turn reduces the required memory BW by 3.5X on average to drive the same network BW compared to state-of-the-art baselines. For modern DL workloads and different network sizes, ACE, on average, increases the effective network bandwidth utilization by 1.44X (up to 2.67X), resulting in an average of 1.41X (up to 1.51X), 1.12X (up to 1.17X), and 1.13X (up to 1.19X) speedup in iteration time for ResNet-50, GNMT and DLRM when compared to the best baseline configuration, respectively.
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this … Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this … Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects. As the size of DL models and the compute efficiency of the … Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects. As the size of DL models and the compute efficiency of the accelerators has continued to increase, there has also been a corresponding steady increase in the bandwidth of these interconnects.Systems today provide 100s of gigabytes (GBs) of inter-connect bandwidth via a mix of solutions such as Multi-Chip packaging modules (MCM) and proprietary interconnects(e.g., NVlink) that together from the scale-up network of accelerators. However, as we identify in this work, a significant portion of this bandwidth goes under-utilized. This is because(i) using compute cores for executing collective operations such as all-reduce decreases overall compute efficiency, and(ii) there is memory bandwidth contention between the accesses for arithmetic operations vs those for collectives, and(iii) there are significant internal bus congestions that increase the latency of communication operations. To address this challenge, we propose a novel microarchitecture, calledAccelerator Collectives Engine(ACE), forDL collective communication offload. ACE is a smart net-work interface (NIC) tuned to cope with the high-bandwidth and low latency requirements of scale-up networks and is able to efficiently drive the various scale-up network systems(e.g. switch-based or point-to-point topologies). We evaluate the benefits of the ACE with micro-benchmarks (e.g. single collective performance) and popular DL models using an end-to-end DL training simulator. For modern DL workloads, ACE on average increases the net-work bandwidth utilization by 1.97X, resulting in 2.71X and 1.44X speedup in iteration time for ResNet-50 and GNMT, respectively.
Large-scale training is important to ensure high performance and accuracy of machine-learning models. At Facebook we use many different models, including computer vision, video and language models. However, in this … Large-scale training is important to ensure high performance and accuracy of machine-learning models. At Facebook we use many different models, including computer vision, video and language models. However, in this paper we focus on the deep learning recommendation models (DLRMs), which are responsible for more than 50% of the training demand in our data centers. Recommendation models present unique challenges in training because they exercise not only compute but also memory capacity as well as memory and network bandwidth. As model size and complexity increase, efficiently scaling training becomes a challenge. To address it we design Zion - Facebook's next-generation large-memory training platform that consists of both CPUs and accelerators. Also, we discuss the design requirements of future scale-out training systems.
The deep neural networks (DNNs) have been enormously successful in tasks that were hitherto in the human-only realm such as image recognition, and language translation. Owing to their success the … The deep neural networks (DNNs) have been enormously successful in tasks that were hitherto in the human-only realm such as image recognition, and language translation. Owing to their success the DNNs are being explored for use in ever more sophisticated tasks. One of the ways that the DNNs are made to scale for the complex undertakings is by increasing their size -- deeper and wider networks can model well the additional complexity. Such large models are trained using model parallelism on multiple compute devices such as multi-GPUs and multi-node systems. In this paper, we develop a compiler-driven approach to achieve model parallelism. We model the computation and communication costs of a dataflow graph that embodies the neural network training process and then, partition the graph using heuristics in such a manner that the communication between compute devices is minimal and we have a good load balance. The hardware scheduling assistants are proposed to assist the compiler in fine tuning the distribution of work at runtime.
The exponential growth in use of large deep neural networks has accelerated the need for training these deep neural networks in hours or even minutes. This can only be achieved … The exponential growth in use of large deep neural networks has accelerated the need for training these deep neural networks in hours or even minutes. This can only be achieved through scalable and efficient distributed training, since a single node/card cannot satisfy the compute, memory, and I/O requirements of today's state-of-the-art deep neural networks. However, scaling synchronous Stochastic Gradient Descent (SGD) is still a challenging problem and requires continued research/development. This entails innovations spanning algorithms, frameworks, communication libraries, and system design. In this paper, we describe the philosophy, design, and implementation of Intel Machine Learning Scalability Library (MLSL) and present proof-points demonstrating scaling DL training on 100s to 1000s of nodes across Cloud and HPC systems.
The state-of-the-art (SOTA) for mixed precision training is dominated by variants of low precision floating point operations, and in particular, FP16 accumulating into FP32 Micikevicius et al. (2017). On the … The state-of-the-art (SOTA) for mixed precision training is dominated by variants of low precision floating point operations, and in particular, FP16 accumulating into FP32 Micikevicius et al. (2017). On the other hand, while a lot of research has also happened in the domain of low and mixed-precision Integer training, these works either present results for non-SOTA networks (for instance only AlexNet for ImageNet-1K), or relatively small datasets (like CIFAR-10). In this work, we train state-of-the-art visual understanding neural networks on the ImageNet-1K dataset, with Integer operations on General Purpose (GP) hardware. In particular, we focus on Integer Fused-Multiply-and-Accumulate (FMA) operations which take two pairs of INT16 operands and accumulate results into an INT32 output.We propose a shared exponent representation of tensors and develop a Dynamic Fixed Point (DFP) scheme suitable for common neural network operations. The nuances of developing an efficient integer convolution kernel is examined, including methods to handle overflow of the INT32 accumulator. We implement CNN training for ResNet-50, GoogLeNet-v1, VGG-16 and AlexNet; and these networks achieve or exceed SOTA accuracy within the same number of iterations as their FP32 counterparts without any change in hyper-parameters and with a 1.8X improvement in end-to-end training throughput. To the best of our knowledge these results represent the first INT16 training results on GP hardware for ImageNet-1K dataset using SOTA CNNs and achieve highest reported accuracy using half-precision
The exponential growth in use of large deep neural networks has accelerated the need for training these deep neural networks in hours or even minutes. This can only be achieved … The exponential growth in use of large deep neural networks has accelerated the need for training these deep neural networks in hours or even minutes. This can only be achieved through scalable and efficient distributed training, since a single node/card cannot satisfy the compute, memory, and I/O requirements of today's state-of-the-art deep neural networks. However, scaling synchronous Stochastic Gradient Descent (SGD) is still a challenging problem and requires continued research/development. This entails innovations spanning algorithms, frameworks, communication libraries, and system design. In this paper, we describe the philosophy, design, and implementation of Intel Machine Learning Scalability Library (MLSL) and present proof-points demonstrating scaling DL training on 100s to 1000s of nodes across Cloud and HPC systems.
This paper presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. We develop supervised convolutional architectures for discriminating signals in high-energy physics … This paper presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. We develop supervised convolutional architectures for discriminating signals in high-energy physics data as well as semi-supervised architectures for localizing and classifying extreme weather in climate data. Our Intelcaffe-based implementation obtains $\sim$2TFLOP/s on a single Cori Phase-II Xeon-Phi node. We use a hybrid strategy employing synchronous node-groups, while using asynchronous communication across groups. We use this strategy to scale training of a single model to $\sim$9600 Xeon-Phi nodes; obtaining peak performance of 11.73-15.07 PFLOP/s and sustained performance of 11.41-13.27 PFLOP/s. At scale, our HEP architecture produces state-of-the-art classification accuracy on a dataset with 10M images, exceeding that achieved by selections on high-level physics-motivated features. Our semi-supervised architecture successfully extracts weather patterns in a 15TB climate dataset. Our results demonstrate that Deep Learning can be optimized and scaled effectively on many-core, HPC systems.
Production cross-sections of prompt charm mesons are measured with the first data from $pp$ collisions at the LHC at a centre-of-mass energy of $13\,\mathrm{TeV}$. The data sample corresponds to an … Production cross-sections of prompt charm mesons are measured with the first data from $pp$ collisions at the LHC at a centre-of-mass energy of $13\,\mathrm{TeV}$. The data sample corresponds to an integrated luminosity of $4.98 \pm 0.19\,\mathrm{pb}^{-1}$ collected by the LHCb experiment. The production cross-sections of $D^{0}$, $D^{+}$, $D_{s}^{+}$, and $D^{*+}$ mesons are measured in bins of charm meson transverse momentum, $p_{\mathrm{T}}$, and rapidity, $y$, and cover the range $0 < p_{\mathrm{T}} < 15\,\mathrm{GeV}/c$ and $2.0 < y < 4.5$. The inclusive cross-sections for the four mesons, including charge conjugation, within the range of $1 < p_{\mathrm{T}} < 8\,\mathrm{GeV}/c$ are found to be \begin{equation} \sigma(pp \to D^{0} X) = 2072 \pm 2 \pm 124\,\mu\mathrm{b}\\ \sigma(pp \to D^{+} X) = 834 \pm 2 \pm \phantom{1}78\,\mu\mathrm{b}\\ \sigma(pp \to D_{s}^{+} X) = 353 \pm 9 \pm \phantom{1}76\,\mu\mathrm{b}\\ \sigma(pp \to D^{*+} X) = 784 \pm 4 \pm \phantom{1}87\,\mu\mathrm{b} \end{equation} where the uncertainties are due to statistical and systematic uncertainties, respectively.
The suppressed decay $Λ^{0}_{b}\rightarrow pπ^{-}μ^{+}μ^{-}$, excluding the $J/ψ$ and $ψ(2S)\rightarrow μ^{+}μ^{-}$ resonances, is observed for the first time with a significance of 5.5 standard deviations. The analysis is performed with … The suppressed decay $Λ^{0}_{b}\rightarrow pπ^{-}μ^{+}μ^{-}$, excluding the $J/ψ$ and $ψ(2S)\rightarrow μ^{+}μ^{-}$ resonances, is observed for the first time with a significance of 5.5 standard deviations. The analysis is performed with proton-proton collision data corresponding to an integrated luminosity of $3\mathrm{fb}^{-1}$ collected with the LHCb experiment. The $Λ^{0}_{b}\rightarrow pπ^{-}μ^{+}μ^{-}$ branching fraction is measured relative to the $Λ^{0}_{b}\rightarrow J/ψ(\rightarrow μ^{+}μ^{-})pπ^{-}$ branching fraction giving \begin{align} \nonumber \frac{\mathcal{B}(Λ^{0}_{b}\rightarrow pπ^{-}μ^{+}μ^{-})}{\mathcal{B}({Λ^{0}_{b}\rightarrow J/ψ(\rightarrow μ^{+}μ^{-})pπ^{-}})} &amp;= 0.044\pm0.012\pm0.007, \end{align} where the first uncertainty is statistical and the second is systematic. This is the first observation of a $b\rightarrow d$ transition in a baryonic decay.
The oscillation frequency, $$\Delta m_d$$ , of $$B^0$$ mesons is measured using semileptonic decays with a $$D^-$$ or $$D^{*-}$$ meson in the final state. The data sample corresponds to 3.0 … The oscillation frequency, $$\Delta m_d$$ , of $$B^0$$ mesons is measured using semileptonic decays with a $$D^-$$ or $$D^{*-}$$ meson in the final state. The data sample corresponds to 3.0 $$fb^{-1}$$ of pp collisions, collected by the LHCb experiment at centre-of-mass energies $$\sqrt{s}$$ = 7 and 8 $$\mathrm \,TeV$$ . A combination of the two decay modes gives $$\Delta m_d = (505.0 \pm 2.1 \pm 1.0) \mathrm \,ns^{-1}$$ , where the first uncertainty is statistical and the second is systematic. This is the most precise single measurement of this parameter. It is consistent with the current world average and has similar precision.
A study of the decay $D^{0}\rightarrow K^{-}π^{+}μ^{+}μ^{-}$ is performed using data collected by the LHCb detector in proton-proton collisions at a centre-of-mass energy of 8 TeV, corresponding to an integrated … A study of the decay $D^{0}\rightarrow K^{-}π^{+}μ^{+}μ^{-}$ is performed using data collected by the LHCb detector in proton-proton collisions at a centre-of-mass energy of 8 TeV, corresponding to an integrated luminosity of 2.0 fb$^{-1}$. Decay candidates with muon pairs that have an invariant mass in the range 675--875 MeV$/c^2$ are considered. This region is dominated by the $ρ^{0}$ and $ω$ resonances. The branching fraction in this range is measured to be ${\cal B}$($D^{0}\rightarrow K^{-}π^{+}μ^{+}μ^{-}$) = $(4.17 \pm 0.12(stat) \pm 0.40(syst))\times10^{-6}$. This is the first observation of the decay $D^{0}\rightarrow K^{-}π^{+}μ^{+}μ^{-}$. Its branching fraction is consistent with the value expected in the Standard Model.
Measurements are presented of electroweak boson production using data from $pp$ collisions at a centre-of-mass energy of $\sqrt{s} = 8\mathrm{\,Te\kern -0.1em V}$. The analysis is based on an integrated luminosity … Measurements are presented of electroweak boson production using data from $pp$ collisions at a centre-of-mass energy of $\sqrt{s} = 8\mathrm{\,Te\kern -0.1em V}$. The analysis is based on an integrated luminosity of $2.0\mathrm{\,fb}^{-1}$ recorded with the LHCb detector. The bosons are identified in the $W\rightarrow\mu\nu$ and $Z\rightarrow\mu^{+}\mu^{-}$ decay channels. The cross-sections are measured for muons in the pseudorapidity range $2.0 < \eta < 4.5$, with transverse momenta $p_{\rm T} > 20{\mathrm{\,Ge\kern -0.1em V\!/}c}$ and, in the case of the $Z$ boson, a dimuon mass within $60 < M_{\mu^{+}\mu^{-}} < 120{\mathrm{\,Ge\kern -0.1em V\!/}c^{2}}$. The results are \begin{align*} \sigma_{W^{+}\rightarrow\mu^{+}\nu} &= 1093.6 \pm 2.1 \pm 7.2 \pm 10.9 \pm 12.7{\rm \,pb} \, , \sigma_{W^{-}\rightarrow\mu^{-}\bar{\nu}} &= \phantom{0}818.4 \pm 1.9 \pm 5.0 \pm \phantom{0}7.0 \pm \phantom{0}9.5{\rm \,pb} \, , \sigma_{Z\rightarrow\mu^{+}\mu^{-}} &= \phantom{00}95.0 \pm 0.3 \pm 0.7 \pm \phantom{0}1.1 \pm \phantom{0}1.1{\rm \,pb} \, , \end{align*} where the first uncertainties are statistical, the second are systematic, the third are due to the knowledge of the LHC beam energy and the fourth are due to the luminosity determination. The evolution of the $W$ and $Z$ boson cross-sections with centre-of-mass energy is studied using previously reported measurements with $1.0\mathrm{\,fb}^{-1}$ of data at $7\mathrm{\,Te\kern -0.1em V}$. Differential distributions are also presented. Results are in good agreement with theoretical predictions at next-to-next-to-leading order in perturbative quantum chromodynamics.
A search for the rare decay of a $B^{0}$ or $B^{0}_{s}$ meson into the final state $J/\psi\gamma$ is performed, using data collected by the LHCb experiment in $pp$ collisions at … A search for the rare decay of a $B^{0}$ or $B^{0}_{s}$ meson into the final state $J/\psi\gamma$ is performed, using data collected by the LHCb experiment in $pp$ collisions at $\sqrt{s}=7$ and $8$ TeV, corresponding to an integrated luminosity of 3 fb$^{-1}$. The observed number of signal candidates is consistent with a background-only hypothesis. Branching fraction values larger than $1.7\times 10^{-6}$ for the $B^{0}\to J/\psi\gamma$ decay mode are excluded at 90% confidence level. For the $B^{0}_{s}\to J/\psi\gamma$ decay mode, branching fraction values larger than $7.4\times 10^{-6}$ are excluded at 90% confidence level, this is the first branching fraction limit for this decay.
A bstract A search for $$ {B}_s^0\to {\overline{D}}^0{f}_0(980) $$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:msubsup><mml:mi>B</mml:mi><mml:mi>s</mml:mi><mml:mn>0</mml:mn></mml:msubsup><mml:mo>→</mml:mo><mml:msup><mml:mover><mml:mi>D</mml:mi><mml:mo>¯</mml:mo></mml:mover><mml:mn>0</mml:mn></mml:msup><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mfenced><mml:mn>980</mml:mn></mml:mfenced></mml:math> decays is performed using 3 . 0 fb 1− of pp collision data recorded by the LHCb experiment during … A bstract A search for $$ {B}_s^0\to {\overline{D}}^0{f}_0(980) $$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:msubsup><mml:mi>B</mml:mi><mml:mi>s</mml:mi><mml:mn>0</mml:mn></mml:msubsup><mml:mo>→</mml:mo><mml:msup><mml:mover><mml:mi>D</mml:mi><mml:mo>¯</mml:mo></mml:mover><mml:mn>0</mml:mn></mml:msup><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mfenced><mml:mn>980</mml:mn></mml:mfenced></mml:math> decays is performed using 3 . 0 fb 1− of pp collision data recorded by the LHCb experiment during 2011 and 2012. The f 0 (980) meson is reconstructed through its decay to the π + π − final state in the mass window 900 MeV /c 2 &lt; m ( π + π − ) &lt; 1080 MeV /c 2 . No significant signal is observed. The first upper limits on the branching fraction of $$ \mathrm{\mathcal{B}}\left({B}_s^0\to {\overline{D}}^0{f}_0(980)\right)&lt;3.1(3.4)\times 1{0}^{-6} $$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>ℬ</mml:mi><mml:mfenced><mml:mrow><mml:msubsup><mml:mi>B</mml:mi><mml:mi>s</mml:mi><mml:mn>0</mml:mn></mml:msubsup><mml:mo>→</mml:mo><mml:msup><mml:mover><mml:mi>D</mml:mi><mml:mo>¯</mml:mo></mml:mover><mml:mn>0</mml:mn></mml:msup><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mfenced><mml:mn>980</mml:mn></mml:mfenced></mml:mrow></mml:mfenced><mml:mo>&lt;</mml:mo><mml:mn>3.1</mml:mn><mml:mfenced><mml:mn>3.4</mml:mn></mml:mfenced><mml:mo>×</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mn>0</mml:mn><mml:mrow><mml:mo>−</mml:mo><mml:mn>6</mml:mn></mml:mrow></mml:msup></mml:math> are set at 90 % (95 %) confidence level.
The decay B‾s0→ψ(2S)K+π− is observed using a data set corresponding to an integrated luminosity of 3.0fb−1 collected by the LHCb experiment in pp collisions at centre-of-mass energies of 7 and … The decay B‾s0→ψ(2S)K+π− is observed using a data set corresponding to an integrated luminosity of 3.0fb−1 collected by the LHCb experiment in pp collisions at centre-of-mass energies of 7 and 8 TeV. The branching fraction relative to the B0→ψ(2S)K+π− decay mode is measured to beB(B‾s0→ψ(2S)K+π−)B(B0→ψ(2S)K+π−)=5.38±0.36(stat)±0.22(syst)±0.31(fs/fd)%, where fs/fd indicates the uncertainty due to the ratio of probabilities for a b quark to hadronise into a Bs0 or B0 meson. Using an amplitude analysis, the fraction of decays proceeding via an intermediate ⁎K⁎(892)0 meson is measured to be 0.645±0.049(stat)±0.049(syst) and its longitudinal polarisation fraction is 0.524±0.056(stat)±0.029(syst). The relative branching fraction for this component is determined to be⁎⁎B(B‾s0→ψ(2S)K⁎(892)0)B(B0→ψ(2S)K⁎(892)0)=5.58±0.57(stat)±0.40(syst)±0.32(fs/fd)%. In addition, the mass splitting between the Bs0 and B0 mesons is measured asM(Bs0)−M(B0)=87.45±0.44(stat)±0.09(syst)MeV/c2.
Invariant mass distributions of B+π− and B0π+ combinations are investigated in order to study excited B mesons. The analysis is based on a data sample corresponding to 3.0 fb−1 of … Invariant mass distributions of B+π− and B0π+ combinations are investigated in order to study excited B mesons. The analysis is based on a data sample corresponding to 3.0 fb−1 of pp collision data, recorded by the LHCb detector at centre-of-mass energies of 7 and 8 TeV. Precise measurements of the masses and widths of the B1(5721)0,+ and B2(5747)0,+ states are reported. Clear enhancements, particularly prominent at high pion transverse momentum, are seen over background in the mass range 5850-6000 MeV in both B+π− and B0π+ combinations. The structures are consistent with the presence of four excited B mesons, labelled B J (5840)0,+ and B J (5960)0,+, whose masses and widths are obtained under different hypotheses for their quantum numbers.
A search is presented for long-lived particles with a mass between 25 and 50 $$\mathrm{GeV}/\mathrm{c}^{2}$$ and a lifetime between 1 and 200 $$\mathrm{\,ps}$$ in a sample of proton–proton collisions at … A search is presented for long-lived particles with a mass between 25 and 50 $$\mathrm{GeV}/\mathrm{c}^{2}$$ and a lifetime between 1 and 200 $$\mathrm{\,ps}$$ in a sample of proton–proton collisions at a centre-of-mass energy of $$\sqrt{s}=7$$ TeV, corresponding to an integrated luminosity of 0.62 $$\text{ fb }^{-1}$$ , collected by the LHCb detector. The particles are assumed to be pair-produced by the decay of a standard model-like Higgs boson. The experimental signature of the long-lived particle is a displaced vertex with two associated jets. No excess above the background is observed and limits are set on the production cross-section as a function of the long-lived particle mass and lifetime.
The resonant substructure of B0s→¯D0K−π+ decays is studied with the Dalitz plot analysis technique. The study is based on a data sample corresponding to an integrated luminosity of 3.0 fb−1 … The resonant substructure of B0s→¯D0K−π+ decays is studied with the Dalitz plot analysis technique. The study is based on a data sample corresponding to an integrated luminosity of 3.0 fb−1 of pp collision data recorded by LHCb. A structure at m(¯D0K−)≈2.86 GeV/c2 is found to be an admixture of spin-1 and spin-3 resonances. The masses and widths of these states and of the D∗s2(2573)− meson are measured, as are the complex amplitudes and fit fractions for all the ¯D0K− and K−π+ components included in the amplitude model. In addition, the D∗s2(2573)− resonance is confirmed to be spin 2.8 MoreReceived 30 July 2014DOI:https://doi.org/10.1103/PhysRevD.90.072003This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.© 2014 CERN, for the LHCb Collaboration
This paper considers demand side management in smart power grid systems containing significant numbers of thermostatically controlled loads such as air conditioning systems, heat pumps, etc. Recent studies have shown … This paper considers demand side management in smart power grid systems containing significant numbers of thermostatically controlled loads such as air conditioning systems, heat pumps, etc. Recent studies have shown that the overall power consumption of such systems can be regulated up and down centrally by broadcasting small setpoint change commands without significantly impacting consumer comfort. However, sudden simultaneous setpoint changes induce undesirable power consumption oscillations due to sudden synchronization of the on/off cycles of the individual units. In this paper, we present a novel algorithm for counter-acting these unwanted oscillations, which requires neither central management of the individual units nor communication between units. We present a formal proof of convergence of homogeneous populations to desynchronized status, as well as simulations that indicate that the algorithm is able to effectively dampen power consumption oscillations for both homogeneous and heterogeneous populations of thermostatically controlled loads.
This paper considers demand side management in smart power grid systems containing significant numbers of thermostatically controlled loads such as air conditioning systems, heat pumps, etc. Recent studies have shown … This paper considers demand side management in smart power grid systems containing significant numbers of thermostatically controlled loads such as air conditioning systems, heat pumps, etc. Recent studies have shown that the overall power consumption of such systems can be regulated up and down centrally by broadcasting small setpoint change commands without significantly impacting consumer comfort. However, sudden simultaneous setpoint changes induce undesirable power consumption oscillations due to sudden synchronization of the on/off cycles of the individual units. In this paper, we present a novel algorithm for counter-acting these unwanted oscillations, which requires neither central management of the individual units nor communication between units. We present a formal proof of convergence of homogeneous populations to desynchronized status, as well as simulations that indicate that the algorithm is able to effectively dampen power consumption oscillations for both homogeneous and heterogeneous populations of thermostatically controlled loads.
In this article we analyze the optimal control strategy for rotating a monitored qubit from an initial pure state to an orthogonal state in minimum time. This strategy is described … In this article we analyze the optimal control strategy for rotating a monitored qubit from an initial pure state to an orthogonal state in minimum time. This strategy is described for two different cost functions of interest which do not have the usual regularity properties. Hence, as classically smooth cost functions may not exist, we interpret these functions as viscosity solutions to the optimal control problem. Specifically we prove their existence and uniqueness in this weak-solution setting. In addition, we also give bounds on the time optimal control to prepare any pure state from a mixed state.
A robust (deterministic) filtering approach to the problem of optimal sensor selection is considered herein. For a given system with several sensors, at each time step the output of one … A robust (deterministic) filtering approach to the problem of optimal sensor selection is considered herein. For a given system with several sensors, at each time step the output of one of the sensors must be chosen in order to obtain the best state estimate. We reformulate this problem in an optimal control framework which can then be solved using dynamic programming. In order to tackle the numerical computation of the solution in an efficient manner, we exploit the preservation of the min-plus structure of the optimal cost function when acted upon by the dynamic programming operator. This technique yields a grid free numerical approach to the problem. Simulations on an example problem serve to highlight the efficacy of this generalizable approach to robust multi-sensor state estimation.
In this article we analyze the optimal control strategy for rotating a monitored qubit from an initial pure state to an orthogonal state in minimum time. This strategy is described … In this article we analyze the optimal control strategy for rotating a monitored qubit from an initial pure state to an orthogonal state in minimum time. This strategy is described for two different cost functions of interest which do not have the usual regularity properties. Hence, as classically smooth cost functions may not exist, we interpret these functions as viscosity solutions to the optimal control problem. Specifically we prove their existence and uniqueness in this weak-solution setting. In addition, we also give bounds on the time optimal control to prepare any pure state from a mixed state.
In this article we explore a modification in the problem of controlling the rotation of a two level quantum system from an initial state to a final state in minimum … In this article we explore a modification in the problem of controlling the rotation of a two level quantum system from an initial state to a final state in minimum time. Specifically we consider the case where the qubit is being weakly monitored -- albeit with an assumption that both the measurement strength as well as the angular velocity are assumed to be control signals. This modification alters the dynamics significantly and enables the exploitation of the measurement backaction to assist in achieving the control objective. The proposed method yields a significant speedup in achieving the desired state transfer compared to previous approaches. These results are demonstrated via numerical solutions for an example problem on a single qubit.
The relationship between efficient quantum gate synthesis and control theory has been a topic of interest in the quantum control literature. Motivated by this work, we describe in the present … The relationship between efficient quantum gate synthesis and control theory has been a topic of interest in the quantum control literature. Motivated by this work, we describe in the present article how the dynamic programming technique from optimal control may be used for the optimal synthesis of quantum circuits. We demonstrate simulation results on an example system on SU(2), to obtain plots related to the gate complexity and sample paths for different logic gates.

Commonly Cited References

The LHCb detector is a forward spectrometer at the Large Hadron Collider (LHC) at CERN. The experiment is designed for precision measurements of CP violation and rare decays of beauty … The LHCb detector is a forward spectrometer at the Large Hadron Collider (LHC) at CERN. The experiment is designed for precision measurements of CP violation and rare decays of beauty and charm hadrons. In this paper the performance of the various LHCb sub-detectors and the trigger system are described, using data taken from 2010 to 2012. It is shown that the design criteria of the experiment have been met. The excellent performance of the detector has allowed the LHCb collaboration to publish a wide range of physics results, demonstrating LHCb's unique role, both as a heavy flavour experiment and as a general purpose detector in the forward region.
This paper presents the design of the LHCb trigger and its performance on data taken at the LHC in 2011. A principal goal of LHCb is to perform flavour physics … This paper presents the design of the LHCb trigger and its performance on data taken at the LHC in 2011. A principal goal of LHCb is to perform flavour physics measurements, and the trigger is designed to distinguish charm and beauty decays from the light quark background. Using a combination of lepton identification and measurements of the particles' transverse momenta the trigger selects particles originating from charm and beauty hadrons, which typically fly a finite distance before decaying. The trigger reduces the roughly 11 MHz of bunch-bunch crossings that contain at least one inelastic pp interaction to 3 kHz. This reduction takes place in two stages; the first stage is implemented in hardware and the second stage is a software application that runs on a large computer farm. A data-driven method is used to evaluate the performance of the trigger on several charm and beauty decay modes.
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly … Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers - 8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
The LHCb experiment has been taking data at the Large Hadron Collider (LHC) at CERN since the end of 2009. One of its key detector components is the Ring-Imaging Cherenkov … The LHCb experiment has been taking data at the Large Hadron Collider (LHC) at CERN since the end of 2009. One of its key detector components is the Ring-Imaging Cherenkov (RICH) system. This provides charged particle identification over a wide momentum range, from 2–100 GeV/c. The operation and control, software, and online monitoring of the RICH system are described. The particle identification performance is presented, as measured using data from the LHC. Excellent separation of hadronic particle types (π, K, p) is achieved.
A critical review is given of the current status of cosmological nucleosynthesis. In the framework of the Standard Model with 3 types of relativistic neutrinos, the baryon-to-photon ratio, $\eta$, corresponding … A critical review is given of the current status of cosmological nucleosynthesis. In the framework of the Standard Model with 3 types of relativistic neutrinos, the baryon-to-photon ratio, $\eta$, corresponding to the inferred primordial abundances of deuterium and helium-4 is consistent with the independent determination of $\eta$ from observations of anisotropies in the cosmic microwave background. However the primordial abundance of lithium-7 inferred from observations is significantly below its expected value. Taking systematic uncertainties in the abundance estimates into account, there is overall concordance in the range $\eta = (5.7-6.7)\times 10^{-10}$ at 95% CL (corresponding to a cosmological baryon density $\Omega_B h^2 = 0.021 - 0.025$). The D and He-4 abundances, when combined with the CMB determination of $\eta$, provide the bound $N_\nu=3.28 \pm 0.28$ on the effective number of neutrino species. Other constraints on new physics are discussed briefly.
A critical review is given of the current status of cosmological nucleosynthesis. In the framework of the Standard Model with 3 types of relativistic neutrinos, the baryon-to-photon ratio, $\eta$, corresponding … A critical review is given of the current status of cosmological nucleosynthesis. In the framework of the Standard Model with 3 types of relativistic neutrinos, the baryon-to-photon ratio, $\eta$, corresponding to the inferred primordial abundances of deuterium and helium-4 is consistent with the independent determination of $\eta$ from observations of anisotropies in the cosmic microwave background. However the primordial abundance of lithium-7 inferred from observations is significantly below its expected value. Taking systematic uncertainties in the abundance estimates into account, there is overall concordance in the range $\eta = (5.7-6.7)\times 10^{-10}$ at 95% CL (corresponding to a cosmological baryon density $\Omega_B h^2 = 0.021 - 0.025$). The D and He-4 abundances, when combined with the CMB determination of $\eta$, provide the bound $N_\nu=3.28 \pm 0.28$ on the effective number of neutrino species. Other constraints on new physics are discussed briefly.
The Review summarizes much of particle physics and cosmology. Using data from previous editions, plus 3,283 new measurements from 899 Japers, we list, evaluate, and average measured properties of gauge … The Review summarizes much of particle physics and cosmology. Using data from previous editions, plus 3,283 new measurements from 899 Japers, we list, evaluate, and average measured properties of gauge bosons and the recently discovered Higgs boson, leptons, quarks, mesons, and baryons. We summarize searches for hypothetical particles such as heavy neutrinos, supersymmetric and technicolor particles, axions, dark photons, etc. All the particle properties and search limits are listed in Summary Tables. We also give numerous tables, figures, formulae, and reviews of topics such as Supersymmetry, Extra Dimensions, Particle Detectors, Probability, and Statistics. Among the 112 reviews are many that are new or heavily revised including those on: Dark Energy, Higgs Boson Physics, Electroweak Model, Neutrino Cross Section Measurements, Monte Carlo Neutrino Generators, Top Quark, Dark Matter, Dynamical Electroweak Symmetry Breaking, Accelerator Physics of Colliders, High-Energy Collider Parameters, Big Bang Nucleosynthesis, Astrophysical Constants and Cosmological Parameters.
The performance of the LHCb Muon system and its stability across the full 2010 data taking with LHC running at ps = 7 TeV energy is studied. The optimization of … The performance of the LHCb Muon system and its stability across the full 2010 data taking with LHC running at ps = 7 TeV energy is studied. The optimization of the detector setting and the time calibration performed with the first collisions delivered by LHC is described. Particle rates, measured for the wide range of luminosities and beam operation conditions experienced during the run, are compared with the values expected from simulation. The space and time alignment of the detectors, chamber efficiency, time resolution and cluster size are evaluated. The detector performance is found to be as expected from specifications or better. Notably the overall efficiency is well above the design requirements
Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a … Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide an efficient solution that speeds up training by up to 5.5$\times$ for a number of real-world benchmark models.
With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning … With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning networks due to their need to handle categorical features and are not well studied or understood. In this paper, we develop a state-of-the-art deep learning recommendation model (DLRM) and provide its implementation in both PyTorch and Caffe2 frameworks. In addition, we design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers. We compare DLRM against existing recommendation models and characterize its performance on the Big Basin AI platform, demonstrating its usefulness as a benchmark for future algorithmic experimentation and system co-design.
We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify … We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design points for different networks. We demonstrate scaling of CNNs on 100s of nodes, and present what we believe to be record training throughputs. A 512 minibatch VGG-A CNN training run is scaled 90X on 128 nodes. Also 256 minibatch VGG-A and OverFeat-FAST networks are scaled 53X and 42X respectively on a 64 node cluster. We also demonstrate the generality of our approach via best-in-class 6.5X scaling for a 7-layer DNN on 16 nodes. Thereafter we attempt to democratize deep-learning by training on an Ethernet based AWS cluster and show ~14X scaling on 16 nodes.
The LHCb Outer Tracker is a gaseous detector covering an area of 5x6 m2 with 12 double layers of straw tubes. The detector with its services are described together with … The LHCb Outer Tracker is a gaseous detector covering an area of 5x6 m2 with 12 double layers of straw tubes. The detector with its services are described together with the commissioning and calibration procedures. Based on data of the first LHC running period from 2010 to 2012, the performance of the readout electronics and the single hit resolution and efficiency are presented. The efficiency to detect a hit in the central half of the straw is estimated to be 99.2%, and the position resolution is determined to be approximately 200 um. The Outer Tracker received a dose in the hottest region corresponding to 0.12 C/cm, and no signs of gain deterioration or other ageing effects are observed.
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since … Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
The widespread application of deep learning has changed the landscape of computation in data centers. In particular, personalized recommendation for content ranking is now largely accomplished using deep neural networks. … The widespread application of deep learning has changed the landscape of computation in data centers. In particular, personalized recommendation for content ranking is now largely accomplished using deep neural networks. However, despite their importance and the amount of compute cycles they consume, relatively little research attention has been devoted to recommendation systems. To facilitate research and advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct in-depth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inference jobs can drastically improve latency-bounded throughput, and diversity across recommendation models leads to different optimization strategies.
The determination of track reconstruction efficiencies at LHCb using J/ψ→μ+μ- decays is presented. Efficiencies above 95% are found for the data taking periods in 2010, 2011, and 2012. The ratio … The determination of track reconstruction efficiencies at LHCb using J/ψ→μ+μ- decays is presented. Efficiencies above 95% are found for the data taking periods in 2010, 2011, and 2012. The ratio of the track reconstruction efficiency of muons in data and simulation is compatible with unity and measured with an uncertainty of 0.8 % for data taking in 2010, and at a precision of 0.4 % for data taking in 2011 and 2012. For hadrons an additional 1.4 % uncertainty due to material interactions is assumed. This result is crucial for accurate cross section and branching fraction measurements in LHCb.
Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit … Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today's hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world's largest language model at the time (17B parameters) with record breaking accuracy.
A bstract The relative production rate of $ B_s^0 $ and B 0 mesons is determined with the hadronic decays $ B_s^0\to D_s^{-}{\pi^{+}} $ and B 0 → D − … A bstract The relative production rate of $ B_s^0 $ and B 0 mesons is determined with the hadronic decays $ B_s^0\to D_s^{-}{\pi^{+}} $ and B 0 → D − K + . The measurement uses data corresponding to 1.0 fb −1 of pp collisions at a centre-of-mass energy of $ \sqrt{s}=7 $ TeV recorded in the forward region with the LHCb experiment. The ratio of production rates, f s /f d , is measured to be 0 . 238 ± 0 . 004 ± 0 . 015 ± 0 . 021, where the first uncertainty is statistical, the second systematic, and the third theoretical. This is combined with a previous LHCb measurement to obtain f s /f d = 0 . 256 ± 0 . 020. The dependence of f s /f d on the transverse momentum and pseudorapidity of the B meson is determined using the decays $ B_s^0\to D_s^{-}{\pi^{+}} $ and B 0 → D − π + . There is evidence for a decrease with increasing transverse momentum, whereas the ratio remains constant as a function of pseudorapidity. In addition, the ratio of branching fractions of the decays B 0 → D − K + and B 0 → D − π + is measured to be 0 . 0822 ± 0 . 0011 (stat) ± 0 . 0025 (syst).
The Vertex Locator (VELO) is a silicon microstrip detector that surrounds the proton-proton interaction region in the LHCb experiment. The performance of the detector during the first years of its … The Vertex Locator (VELO) is a silicon microstrip detector that surrounds the proton-proton interaction region in the LHCb experiment. The performance of the detector during the first years of its physics operation is reviewed. The system is operated in vacuum, uses a bi-phase CO2 cooling system, and the sensors are moved to 7 mm from the LHC beam for physics data taking. The performance and stability of these characteristic features of the detector are described, and details of the material budget are given. The calibration of the timing and the data processing algorithms that are implemented in FPGAs are described. The system performance is fully characterised. The sensors have a signal to noise ratio of approximately 20 and a best hit resolution of 4 microns is achieved at the optimal track angle. The typical detector occupancy for minimum bias events in standard operating conditions in 2011 is around 0.5%, and the detector has less than 1% of faulty strips. The proximity of the detector to the beam means that the inner regions of the n+-on-n sensors have undergone space-charge sign inversion due to radiation damage. The VELO performance parameters that drive the experiment's physics sensitivity are also given. The track finding efficiency of the VELO is typically above 98% and the modules have been aligned to a precision of 1 micron for translations in the plane transverse to the beam. A primary vertex resolution of 13 microns in the transverse plane and 71 microns along the beam axis is achieved for vertices with 25 tracks. An impact parameter resolution of less than 35 microns is achieved for particles with transverse momentum greater than 1 GeV/c.
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are … Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
In this paper, we study the design of pulse sequences for NMR spectroscopy as a problem of time optimal control of the unitary propagator. Radio frequency pulses are used in … In this paper, we study the design of pulse sequences for NMR spectroscopy as a problem of time optimal control of the unitary propagator. Radio frequency pulses are used in coherent spectroscopy to implement a unitary transfer of state. Pulse sequences that accomplish a desired transfer should be as short as possible in order to minimize the effects of relaxation and to optimize the sensitivity of the experiments. Here, we give an analytical characterization of such time optimal pulse sequences applicable to coherence transfer experiments in multiple-spin systems. We have adopted a general mathematical formulation, and present many of our results in this setting, mindful of the fact that new structures in optimal pulse design are constantly arising. Moreover, the general proofs are no more difficult than the specific problems of current interest. From a general control theory perspective, the problems we want to study have the following character. Suppose we are given a controllable right invariant system on a compact Lie group, what is the minimum time required to steer the system from some initial point to a specified final point? In NMR spectroscopy and quantum computing, this translates to, what is the minimum time required to produce a unitary propagator? We also give an analytical characterization of maximum achievable transfer in a given time for the two spin system.
Measurements of $b$ hadron production ratios in proton-proton collisions at a centre-of-mass energy of 7 TeV with an integrated luminosity of 3 pb$^{-1}$ are presented. We study the ratios of … Measurements of $b$ hadron production ratios in proton-proton collisions at a centre-of-mass energy of 7 TeV with an integrated luminosity of 3 pb$^{-1}$ are presented. We study the ratios of strange $B$ meson to light $B$ meson production $f_s/(f_u+f_d)$ and $\Lambda_b^0$ baryon to light $B$ meson production $f_{\Lambda_b}/(f_u+f_d)$ as a function of the charmed hadron-muon pair transverse momentum $p_T$ and the $b$ hadron pseudorapidity $\eta$, for $p_T$ between 0 and 14 GeV and $\eta$ between 2 and 5. We find that $f_s/(f_u+f_d)$ is consistent with being independent of $p_{\rm T}$ and $\eta$, and we determine $f_s/(f_u+f_d)$ = 0.134$\pm$ 0.004 $^{+0.011}_{-0.010}$, where the first error is statistical and the second systematic. The corresponding ratio $f_{\Lambda_b}/(f_u+f_d)$ is found to be dependent upon the transverse momentum of the charmed hadron-muon pair, $f_{\Lambda_b}/(f_u+f_d)=(0.404\pm 0.017 (stat) \pm 0.027 (syst) \pm 0.105 (Br))\times[1 -(0.031 \pm 0.004 (stat) \pm 0.003 (syst))\times p_T(GeV)]$, where Br reflects an absolute scale uncertainty due to the poorly known branching fraction Br(\Lambda_c^+ \to pK^-\pi^+)$. We extract the ratio of strange $B$ meson to light neutral $B$ meson production $f_s/f_d$ by averaging the result reported here with two previous measurements derived from the relative abundances of $\bar{B}_s \to D_S^+ \pi ^-$ to $\bar{B}^0 \to D^+K^-$ and $\bar{B}^0 \to D^+\pi^-$. We obtain $f_s/f_d=0.267^{+0.021}_{-0.020}$.
Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, … Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark's flexibility and adaptability.
We prove upper and lower bounds relating the quantum gate complexity of a unitary operation, $U$, to the optimal control cost associated to the synthesis of $U$. These bounds apply … We prove upper and lower bounds relating the quantum gate complexity of a unitary operation, $U$, to the optimal control cost associated to the synthesis of $U$. These bounds apply for any optimal control problem, and can be used to show that the quantum gate complexity is essentially equivalent to the optimal control cost for a wide range of problems, including time-optimal control and finding minimal distances on certain Riemannian, sub-Riemannian, and Finslerian manifolds. These results generalize the results of [Nielsen, Dowling, Gu, and Doherty, Science 311, 1133 (2006)], which showed that the gate complexity can be related to distances on a Riemannian manifold.
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling … Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the … High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.
We develop dynamical programming methods for the purpose of optimal control of quantum states with convex constraints and concave cost and bequest functions of the quantum state. We consider both … We develop dynamical programming methods for the purpose of optimal control of quantum states with convex constraints and concave cost and bequest functions of the quantum state. We consider both open loop and feedback control schemes, which correspond, respectively, to deterministic and stochastic master equation dynamics. For the quantum feedback control scheme with continuous nondemolition observations, we exploit the separation theorem of filtering and control aspects for quantum stochastic dynamics to derive a generalized Hamilton-Jacobi-Bellman equation. If the control is restricted to only Hamiltonian terms this is equivalent to a Hamilton-Jacobi equation with an extra linear dissipative term. In this work, we consider, in particular, the case when control is restricted only to observation. A controlled qubit is considered as an example throughout the development of the formalism. Finally, we discuss optimum observation strategies to obtain a pure state from a mixed state of a quantum two-level system.
Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but … Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at this https URL
Scaling the distributed deep learning to a massive GPU cluster level is challenging due to the instability of the large mini-batch training and the overhead of the gradient synchronization. We … Scaling the distributed deep learning to a massive GPU cluster level is challenging due to the instability of the large mini-batch training and the overhead of the gradient synchronization. We address the instability of the large mini-batch training with batch-size control and label smoothing. We address the overhead of the gradient synchronization with 2D-Torus all-reduce. Specifically, 2D-Torus all-reduce arranges GPUs in a logical 2D grid and performs a series of collective operation in different orientations. These two techniques are implemented with Neural Network Libraries (NNL). We have successfully trained ImageNet/ResNet-50 in 122 seconds without significant accuracy loss on ABCI cluster.
High-level triggering is a vital component of many modern particle physics experiments. This paper describes a modification to the standard boosted decision tree (BDT) classifier, the so-called bonsai BDT, that … High-level triggering is a vital component of many modern particle physics experiments. This paper describes a modification to the standard boosted decision tree (BDT) classifier, the so-called bonsai BDT, that has the following important properties: it is more efficient than traditional cut-based approaches; it is robust against detector instabilities, and it is very fast. Thus, it is fit-for-purpose for the online running conditions faced by any large-scale data acquisition system.
The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network … The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computational characteristics of our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions for the future general-purpose/accelerated inference hardware. Also, we highlight the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers.
Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed … Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile.
The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the … The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.
The creation of practical deep learning data-products often requires parallelization across processors and computers to make deep learning feasible on large data sets, but bottlenecks in communication bandwidth make it … The creation of practical deep learning data-products often requires parallelization across processors and computers to make deep learning feasible on large data sets, but bottlenecks in communication bandwidth make it difficult to attain good speedups through parallelism. Here we develop and test 8-bit approximation algorithms which make better use of the available bandwidth by compressing 32-bit gradients and nonlinear activations to 8-bit approximations. We show that these approximations do not decrease predictive performance on MNIST, CIFAR10, and ImageNet for both model and data parallelism and provide a data transfer speedup of 2x relative to 32-bit parallelism. We build a predictive model for speedups based on our experimental data, verify its validity on known speedup data, and show that we can obtain a speedup of 50x and more on a system of 96 GPUs compared to a speedup of 23x for 32-bit. We compare our data types with other methods and show that 8-bit approximations achieve state-of-the-art speedups for model parallelism. Thus 8-bit approximation is an efficient method to parallelize convolutional networks on very large systems of GPUs.
The resonant structure of the reaction ${\overline{B}}^{0}\ensuremath{\rightarrow}J/\ensuremath{\psi}{\ensuremath{\pi}}^{+}{\ensuremath{\pi}}^{\ensuremath{-}}$ is studied using data from $3\text{ }\text{ }{\mathrm{fb}}^{\ensuremath{-}1}$ of integrated luminosity collected by the LHCb experiment, one third at 7 TeV center-of-mass energy … The resonant structure of the reaction ${\overline{B}}^{0}\ensuremath{\rightarrow}J/\ensuremath{\psi}{\ensuremath{\pi}}^{+}{\ensuremath{\pi}}^{\ensuremath{-}}$ is studied using data from $3\text{ }\text{ }{\mathrm{fb}}^{\ensuremath{-}1}$ of integrated luminosity collected by the LHCb experiment, one third at 7 TeV center-of-mass energy and the remainder at 8 TeV. The invariant mass of the ${\ensuremath{\pi}}^{+}{\ensuremath{\pi}}^{\ensuremath{-}}$ pair and three decay angular distributions are used to determine the fractions of the resonant and nonresonant components. Six interfering ${\ensuremath{\pi}}^{+}{\ensuremath{\pi}}^{\ensuremath{-}}$ states, $\ensuremath{\rho}(770)$, ${f}_{0}(500)$, ${f}_{2}(1270)$, $\ensuremath{\rho}(1450)$, $\ensuremath{\omega}(782)$ and $\ensuremath{\rho}(1700)$, are required to give a good description of invariant mass spectra and decay angular distributions. The positive and negative charge parity fractions of each of the resonant final states are determined. The ${f}_{0}(980)$ meson is not seen and the upper limit on its presence, compared with the observed ${f}_{0}(500)$ rate, is inconsistent with a model where these scalar mesons are formed from two quarks and two antiquarks (tetraquarks) at the eight standard deviation level. In the $q\overline{q}$ model, the absolute value of the mixing angle between the ${f}_{0}(980)$ and the ${f}_{0}(500)$ scalar mesons is limited to be less than 17\ifmmode^\circ\else\textdegree\fi{} at 90% confidence level.
The resonant substructure of B0s→¯D0K−π+ decays is studied with the Dalitz plot analysis technique. The study is based on a data sample corresponding to an integrated luminosity of 3.0 fb−1 … The resonant substructure of B0s→¯D0K−π+ decays is studied with the Dalitz plot analysis technique. The study is based on a data sample corresponding to an integrated luminosity of 3.0 fb−1 of pp collision data recorded by LHCb. A structure at m(¯D0K−)≈2.86 GeV/c2 is found to be an admixture of spin-1 and spin-3 resonances. The masses and widths of these states and of the D∗s2(2573)− meson are measured, as are the complex amplitudes and fit fractions for all the ¯D0K− and K−π+ components included in the amplitude model. In addition, the D∗s2(2573)− resonance is confirmed to be spin 2.8 MoreReceived 30 July 2014DOI:https://doi.org/10.1103/PhysRevD.90.072003This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.© 2014 CERN, for the LHCb Collaboration
In this paper, we demonstrate how optimal control methods can be used to speed up the implementation of modules of quantum algorithms or quantum simulations in networks of coupled qubits. … In this paper, we demonstrate how optimal control methods can be used to speed up the implementation of modules of quantum algorithms or quantum simulations in networks of coupled qubits. The gain is most prominent in realistic cases, where the qubits are not all mutually coupled. Thus the shortest times obtained depend on the coupling topology as well as on the characteristic ratio of the time scales for local controls vs nonlocal (i.e., coupling) evolutions in the specific experimental setting. Relating these minimal times to the number of qubits gives the tightest known upper bounds to the actual time complexity of the quantum modules. As will be shown, time complexity is a more realistic measure of the experimental cost than the usual gate complexity. In the limit of fast local controls (as, e.g., in NMR), time-optimized realizations are shown for the quantum Fourier transform (QFT) and the multiply controlled NOT gate $({\mathrm{C}}^{n\ensuremath{-}1}\mathrm{NOT})$ in various coupling topologies of $n$ qubits. The speed-ups are substantial: in a chain of six qubits the quantum Fourier transform so far obtained by optimal control is more than eight times faster than the standard decomposition into controlled phase, Hadamard and SWAP gates, while the ${\mathrm{C}}^{n\ensuremath{-}1}\mathrm{NOT}$ gate for a completely coupled network of six qubits is nearly seven times faster.
Bottom baryons decaying to a J/ψ meson and a hyperon are reconstructed using 1.0 fb(-1) of data collected in 2011 with the LHCb detector. Significant Λ(b)(0) → J/ψΛ, Ξ(b((-) → … Bottom baryons decaying to a J/ψ meson and a hyperon are reconstructed using 1.0 fb(-1) of data collected in 2011 with the LHCb detector. Significant Λ(b)(0) → J/ψΛ, Ξ(b((-) → J/ψΞ(-) and Ω(b)(-) → J/ψΩ(-) signals are observed and the corresponding masses are measured to be M(Λ(b)(0))=5619.53 ± 0.13(stat.) ± 0.45(syst.) MeV/c(2), M(Ξ(b)(-)) = 5795.8 ± 0.9(stat.) ± 0.4(syst.) MeV/c(2), M(Ω(b)(-)) = 6046.0 ± 2.2(stat.) ± 0.5(syst.) MeV/c(2) , while the differences with respect to the Λ(b)(0) mass are M(Ξ(b)(-))-M(Λ(b)(0))=176.2 ± 0.9(stat.) ± 0.1(syst.) MeV/c(2), M(Ω(b)(-))-M(Λ(b)(0))=426.4 ± 2.2(stat.) ± 0.4(syst.) MeV/c(2). These are the most precise mass measurements of the Λ(b)(0), Ξ(b)(-) and Ω(b)(-) baryons to date. Averaging the above Λ(b)(0) mass measurement with that published by LHCb using 35 pb(-1) of data collected in 2010 yields M(Λ(b)(0)) = 5619.44 ± 0.13(stat.)± 0.38(syst.) MeV/c(2).
Measurements of $b$-hadron lifetimes are reported using $pp$ collision data, corresponding to an integrated luminosity of 1.0fb$^{-1}$, collected by the LHCb detector at a centre-of-mass energy of $7$Tev. Using the … Measurements of $b$-hadron lifetimes are reported using $pp$ collision data, corresponding to an integrated luminosity of 1.0fb$^{-1}$, collected by the LHCb detector at a centre-of-mass energy of $7$Tev. Using the exclusive decays $B^+\to J/\psi K^+$, $B^0\to J/\psi K^*(892)^0$, $B^0\to J/\psi K^0_{\rm S}$, $\Lambda_b^0\to J/\psi \Lambda$ and $B^0_s\to J/\psi \phi$ the average decay times in these modes are measured to be $\tau_{B^+\to J/\psi K^+}$ = $1.637 \pm$ 0.004 $\pm$ 0.003 ps, $\tau_{B^0\to J/\psi K^*(892)^0}$ = $1.524 \pm$ 0.006 $\pm$ 0.004 ps, $\tau_{B^0\to J/\psi K^0_{\rm S}}$ = $1.499 \pm$ 0.013 $\pm$ 0.005 ps, $\tau_{\Lambda_b^0\to J/\psi \Lambda}$ = $1.415 \pm$ 0.027 $\pm$ 0.006 ps and $\tau_{B^0_s\to J/\psi \phi}$ = $1.480 \pm$ 0.011 $\pm$ 0.005 ps, where the first uncertainty is statistical and the second is systematic. These represent the most precise lifetime measurements in these decay modes. In addition, ratios of these lifetimes, and the ratio of the decay-width difference, $\Delta\Gamma_d$, to the average width, $\Gamma_d$, in the $B^0$ system, $\Delta \Gamma_d/\Gamma_d = -0.044 \pm 0.025 \pm 0.011$, are reported. All quantities are found to be consistent with Standard Model expectations.
Resonant structures in B^{0}→ψ^{'}π^{-}K^{+} decays are analyzed by performing a four-dimensional fit of the decay amplitude, using pp collision data corresponding to 3 fb^{-1} collected with the LHCb detector. The … Resonant structures in B^{0}→ψ^{'}π^{-}K^{+} decays are analyzed by performing a four-dimensional fit of the decay amplitude, using pp collision data corresponding to 3 fb^{-1} collected with the LHCb detector. The data cannot be described with K^{+}π^{-} resonances alone, which is confirmed with a model-independent approach. A highly significant Z(4430)^{-}→ψ^{'}π^{-} component is required, thus confirming the existence of this state. The observed evolution of the Z(4430)^{-} amplitude with the ψ^{'}π^{-} mass establishes the resonant nature of this particle. The mass and width measurements are substantially improved. The spin parity is determined unambiguously to be 1^{+}.
We present an interpretation of Inception modules in convolutional neural networks as being an intermediate step in-between regular convolution and the depthwise separable convolution operation (a depthwise convolution followed by … We present an interpretation of Inception modules in convolutional neural networks as being an intermediate step in-between regular convolution and the depthwise separable convolution operation (a depthwise convolution followed by a pointwise convolution). In this light, a depthwise separable convolution can be understood as an Inception module with a maximally large number of towers. This observation leads us to propose a novel deep convolutional neural network architecture inspired by Inception, where Inception modules have been replaced with depthwise separable convolutions. We show that this architecture, dubbed Xception, slightly outperforms Inception V3 on the ImageNet dataset (which Inception V3 was designed for), and significantly outperforms Inception V3 on a larger image classification dataset comprising 350 million images and 17,000 classes. Since the Xception architecture has the same number of parameters as Inception V3, the performance gains are not due to increased capacity but rather to a more efficient use of model parameters.
The decay ${\overline{B}}_{s}^{0}\ensuremath{\rightarrow}J/\ensuremath{\psi}{\ensuremath{\pi}}^{+}{\ensuremath{\pi}}^{\ensuremath{-}}$ can be exploited to study $CP$ violation. A detailed understanding of its structure is imperative in order to optimize its usefulness. An analysis of this three-body final … The decay ${\overline{B}}_{s}^{0}\ensuremath{\rightarrow}J/\ensuremath{\psi}{\ensuremath{\pi}}^{+}{\ensuremath{\pi}}^{\ensuremath{-}}$ can be exploited to study $CP$ violation. A detailed understanding of its structure is imperative in order to optimize its usefulness. An analysis of this three-body final state is performed using a $1.0\text{ }\text{ }{\mathrm{fb}}^{\ensuremath{-}1}$ sample of data produced in 7 TeV $pp$ collisions at the LHC and collected by the LHCb experiment. A modified Dalitz plot analysis of the final state is performed using both the invariant mass spectra and the decay angular distributions. The ${\ensuremath{\pi}}^{+}{\ensuremath{\pi}}^{\ensuremath{-}}$ system is shown to be dominantly in an $S$-wave state, and the $CP$-odd fraction in this ${\overline{B}}_{s}^{0}$ decay is shown to be greater than 0.977 at 95% confidence level. In addition, we report the first measurement of the $J/\ensuremath{\psi}{\ensuremath{\pi}}^{+}{\ensuremath{\pi}}^{\ensuremath{-}}$ branching fraction relative to $J/\ensuremath{\psi}\ensuremath{\phi}$ of $(19.79\ifmmode\pm\else\textpm\fi{}0.47\ifmmode\pm\else\textpm\fi{}0.52)%$.
A bstract Using three- and four-body decays of D mesons produced in semileptonic b -hadron decays, precision measurements of D meson mass differences are made together with a measurement of … A bstract Using three- and four-body decays of D mesons produced in semileptonic b -hadron decays, precision measurements of D meson mass differences are made together with a measurement of the D 0 mass. The measurements are based on a dataset corresponding to an integrated luminosity of 1 . 0 fb −1 collected in pp collisions at 7 TeV. Using the decay D 0 → K + K − K − π + , the D 0 mass is measured to be $ M\left( {{D^0}} \right)={{{1864.75\pm 0.15\left( {\mathrm{stat}} \right)\pm 0.11\left( {\mathrm{syst}} \right)\mathrm{MeV}}} \left/ {{{c^2}}} \right.}. $ The mass differences $ \begin{array}{*{20}{c}} {{{{M\left( {{D^{+}}} \right)-M\left( {{D^0}} \right)=4.76\pm 0.12\left( {\mathrm{stat}} \right)\pm 0.07\left( {\mathrm{syst}} \right)\mathrm{MeV}}} \left/ {{{c^2}}} \right.},} \hfill \\ {{{{M\left( {{D^s}} \right)-M\left( {{D^{+}}} \right)=98.68\pm 0.03\left( {\mathrm{stat}} \right)\pm 0.04\left( {\mathrm{syst}} \right)\mathrm{MeV}}} \left/ {{{c^2}}} \right.}} \hfill \\ \end{array} $ are measured using the D 0 → K + K − π + π − and $ D_{(s)}^{+}\to {K^{+}}{K^{-}}{\pi^{+}} $ modes.
All the available experimental information on open charm and beauty mesons is used to classify the observed states in heavy quark doublets. The masses of some of the still unobserved … All the available experimental information on open charm and beauty mesons is used to classify the observed states in heavy quark doublets. The masses of some of the still unobserved states are predicted, in particular in the beauty sector. Adopting an effective Lagrangian approach based on the heavy quark and chiral symmetry, individual decay rates and ratios of branching fractions are computed, with results useful to assign the quantum numbers to recently observed charmed states which still need to be properly classified. Implications and predictions for the corresponding beauty mesons are provided. The experimental results are already copious, and are expected to grow up thanks to the experiments at the LHC and to the future high-luminosity flavor and $p\ensuremath{-}\overline{p}$ facilities.
The resonant substructure of B0s→¯D0K−π+ decays is studied using a data sample corresponding to an integrated luminosity of 3.0 fb−1 of pp collision data recorded by the LHCb detector. An … The resonant substructure of B0s→¯D0K−π+ decays is studied using a data sample corresponding to an integrated luminosity of 3.0 fb−1 of pp collision data recorded by the LHCb detector. An excess at m(¯D0K−)≈2.86 GeV/c2 is found to be an admixture of spin-1 and spin-3 resonances. Therefore, the D∗sJ(2860)− state previously observed in inclusive e+e−→¯D0K−X and pp→¯D0K−X processes consists of at least two particles. This is the first observation of a heavy flavored spin-3 resonance, and the first time that any spin-3 particle has been seen to be produced in B decays. The masses and widths of the new states and of the D∗s2(2573)− meson are measured, giving the most precise determinations to date.Received 30 July 2014DOI:https://doi.org/10.1103/PhysRevLett.113.162001This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.© 2014 CERN, for the LHCb collaboration