Optimized GPU kernels are sufficiently complicated to write that they often are specialized to specific input data, target architectures, or applications. This paper presents a multi-search abstraction for computing multiple breadth-first searches in parallel and demonstrates a high-performance, general implementation. Our abstraction removes the burden of orchestrating graph traversal from the user while providing high performance and low energy usage, an often overlooked component of algorithm design. Energy consumption has become a first-class hardware design constraint for both massive and embedded computing platforms. Our abstraction can be applied to such problems as the all-pairs shortest-path problem, community detection, reachability querying, and others. To map graph traversal efficiently to NVIDIA GPUs, our hybrid implementation chooses between processing active vertices with a single thread or an entire warp based on vertex outdegree. For a set of twelve varied graphs, the implementation of our abstraction saves 42% time and 62% energy on average compared to representative implementations of specific applications from existing literature.
High-performance graph analysis is unlocking knowledge in problems like anomaly detection in computer security, community structure in social networks, and many other data integration areas. While graphs provide a convenient abstraction, real-world problems' sparsity and lack of locality challenge current systems. This talk will cover current trends ranging from massive scales to low-power, low-latency systems and summarize opportunities and directions for graphs and computing systems.
Applications of high-performance graph analysis range from computational biology to network security and even transportation. These applications often consider graphs under rapid change and are moving beyond HPC platforms into energy-constrained embedded systems. This paper optimizes one successful and demanding analysis kernel, betweenness centrality, for NVIDIA GPU accelerators in both environments. Our algorithm for static analysis is capable of exceeding 2 million traversed edges per second per watt (MTEPS/W). Optimizing the parallel algorithm and treating the dynamic problem directly achieves a 6.39× average speed-up and 84% average reduction in energy consumption.
DNA sequence analysis is fundamental to life science research. The rapid development of next generation sequencing (NGS) technologies, and the richness and diversity of applications it makes feasible, have created an enormous gulf between the potential of this technology and the development of computational methods to realize this potential. Bridging this gap holds possibilities for broad impacts toward multiple grand challenges and offers unprecedented opportunities for software innovation and research. We argue that NGS-enabled applications need a critical mass of sustainable software to benefit from emerging computing platforms' transformative potential. Accumulating the necessary critical mass will require leaders in computational biology, bioinformatics, computer science, and computer engineering work together to identify core opportunity areas, critical software infrastructure, and software sustainability challenges. Furthermore, due to the quickly changing nature of both bioinformatics software and accelerator technology, we conclude that creating sustainable accelerated bioinformatics software means constructing a sustainable bridge between the two fields. In particular, sustained collaboration between domain developers and technology experts is needed to develop the accelerated kernels, libraries, frameworks and middleware that could provide the needed flexible link from NGS bioinformatics applications to emerging platforms.
The digital world has given rise to massive quantities of data that include rich semantic and complex networks. A social graph, for example, containing hundreds of millions of actors and tens of billions of relationships is not uncommon. Analyzing these large data sets, even to answer simple analytic queries, often pushes the limits of algorithms and machine architectures. We present GraphCT, a scalable framework for graph analysis using parallel and multithreaded algorithms on shared memory platforms. Utilizing the unique characteristics of the Cray XMT, GraphCT enables fast network analysis at unprecedented scales on a variety of input data sets. On a synthetic power law graph with 2 billion vertices and 17 billion edges, we can find the connected components in 2 minutes. We can estimate the betweenness centrality of a similar graph with 537 million vertices and over 8 billion edges in under 1 hour. GraphCT is built for portability and performance.
Handling the constant stream of data from health care, security, business, and social network applications requires new algorithms and data structures. We present a new approach for parallel massive analysis of streaming, temporal, graph-structured data. For this purpose we examine data structure and algorithm trade-offs that extract the parallelism necessary for high-performance updating analysis of massive graphs. As a result of this study, we propose the extensible and flexible data structure for massive graphs called STINGER (Spatio-Temporal Interaction Networks and Graphs Extensible Representation). Two case studies demonstrate our new approach's effectiveness. The first one computes a dynamic graph's vertices' clustering coefficients. We show that incremental updates are far more efficient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coefficients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom filter. On 32 processors of a with a synthetic scale-free graph of 2^{24} ≈16 million vertices and 2^{29} ≈537 million edges, the brute-force method processes a mean of over 50000 updates per second, while our Bloom filter approaches 200000 updates per second. The second case study monitors a global feature, a dynamic graph's connected components. We use similar algorithmic ideas as before to exploit the parallelism in the problem and provided by the hardware architecture. On a 16 million vertex graph, we obtain rates of up to 240000 updates per second on 32 processors of a . For the large scale-free graphs typical in our applications, our implementation uses novel batching techniques that exploit the scale-free nature of the data and run over three times faster than prior methods. Our new framework is the first to handle real-world data rates, opening the door to higher-level analytics such as community and anomaly detection.
Analyzing static snapshots of massive, graph-structured data cannot keep pace with the growth of social networks, financial transactions, and other valuable data sources. Current state-of-the-art industrial methods analyze these streaming sources using only simple, aggregate metrics. There are few existing scalable algorithms for monitoring complex global quantities like decomposition into community structure. Using our framework STING, we present the first known parallel algorithm specifically for monitoring communities in this massive, streaming, graph-structured data. Our algorithm performs incremental re-agglomeration rather than starting from scratch after each batch of changes, reducing the problem's size to that of the change rather than the entire graph. We analyze our initial implementation's performance on multithreaded platforms for execution time and latency. On an Intel-based multithreaded platform, our algorithm handles up to 100 million updates per second on social networks with one to 30 million edges, providing a speed-up from 4× to 3700× over statically recomputing the decomposition after each batch of changes. Possibly because of our artificial graph generator, resulting communities' modularity varies little from the initial graph.
Emerging real-world graph problems include detecting community structure in large social networks, improving the resilience of the electric power grid, and detecting and preventing disease in human populations. We discuss the opportunities and challenges in massive data-intensive computing for applications in social network analysis, genomics, and security. The explosion of real-world graph data poses substantial challenges for software, hardware, algorithms, and application experts.
Analyzing static snapshots of massive, graph-structured data cannot keep pace with the growth of social networks, financial transactions, and other valuable data sources. Our software framework, STING (Spatio-Temporal Interaction Networks and Graphs), uses a scalable, high-performance graph data structure to enable these applications. STING supports fast insertions, deletions, and updates on graphs with semantic information and skewed degree distributions. STING achieves large speed-ups over parallel, static recomputation on both common multicore and specialized multithreaded platforms.
The DARPA High Productivity Computing Systems (HPCS) program has been focused on providing a new generation of economically viable high productivity computing systems for national security, scientific, industrial and commercial applications. This program was unique because it focused on system productivity that was defined to include enhancing performance, programmability, portability, usability, manageability and robustness of systems as opposed to just being focused on one execution time performance metric. The BOF is for anyone interested in learning about the two HPCS systems and how productivity in High Performance Computing has been enhanced.
The current research focus on “big data” problems highlights the scale and complexity of analytics required and the high rate at which data may be changing. In this paper, we present our high performance, scalable and portable software, Spatio-Temporal Interaction Networks and Graphs Extensible Representation (STINGER), that includes a graph data structure that enables these applications. Key attributes of STINGER are fast insertions, deletions, and updates on semantic graphs with skewed degree distributions. We demonstrate a process of algorithmic and architectural optimizations that enable high performance on the Cray XMT family and Intel multicore servers. Our implementation of STINGER on the Cray XMT processes over 3 million updates per second on a scale-free graph with 537 million edges.
Emerging real-world graph problems include detecting community structure in large social networks, improving the resilience of the electric power grid, and detecting and preventing disease in human populations. The volume and richness of data combined with its rate of change renders monitoring properties at scale by static recomputation infeasible. We approach these problems with massive, fine-grained parallelism across different shared memory architectures both to compute solutions and to explore the sensitivity of these solutions to natural bias and omissions within the data.
The volume of existing graph-structured data requires improved parallel tools and algorithms. Finding communities, smaller subgraphs densely connected within the subgraph than to the rest of the graph, plays a role both in developing new parallel algorithms as well as opening smaller portions of the data to current analysis tools. We improve performance of our parallel community detection algorithm by 20% on the massively multithreaded Cray XMT, evaluate its performance on the next-generation Cray XMT2, and extend its reach to Intel-based platforms with OpenMP. To our knowledge, not only is this the first massively parallel community detection algorithm but also the only such algorithm that achieves excellent performance and good parallel scalability across all these platforms. Our implementation analyzes a moderate sized graph with 105 million vertices and 3.3 billion edges in around 500 seconds on a four processor, 80-logical-core Intel-based system and 1100 seconds on a 64-processor Cray XMT2.
Analyzing static snapshots of massive, graph-structured data cannot keep pace with the growth of social networks, financial transactions, and other valuable data sources. We introduce a framework, STING (Spatio-Temporal Interaction Networks and Graphs), and evaluate its performance on multicore, multisocket Intel(R)-based platforms. STING achieves rates of around 100000 edge updates per second on large, dynamic graphs with a single, general data structure. We achieve speed-ups of up to 1000× over parallel static computation, improve monitoring a dynamic graph's connected components, and show an exact algorithm for maintaining local clustering coefficients performs better on Intel-based platforms than our earlier approximate algorithm.
Tackling the current volume of graph-structured data requires parallel tools. We extend our work on analyzing such massive graph data with a massively parallel algorithm for community detection that scales to current data sizes, clustering a real-world graph of over 100 million vertices and over 3 billion edges in under 500 seconds on a four- processor Intel E7-8870-based server. Our algorithm achieves moderate parallel scalability without sacrificing sequential operational complexity. Community detection partitions a graph into subgraphs more densely connected within the subgraph than to the rest of the graph. We take an agglomerative approach similar to Clauset, Newman, and Moore’s sequential algorithm, merging pairs of connected intermediate subgraphs to optimize different graph properties. Working in parallel opens new approaches to high performance. We improve performance of our parallel community detection algorithm on both the Cray XMT2 and OpenMP platforms and adapt our algorithm to the DIMACS Implementation Challenge data set.
Current tools for analyzing graph-structured data and semantic networks focus on static graphs. Our STING package tackles analysis of streaming graphs like today's social networks and communication tools. STING maintains a massive graph under changes while coordinating analysis kernels to achieve analysis at real-world data rates. We show examples of local metrics like clustering coefficients and global metrics like connected components and agglomerative clustering. STING supports parallel Intel architectures as well as the Cray XMT.
Graph-structured data in social networks, finance, network security, and others not only are massive but also under continual change. These changes often are scattered across the graph. Repeating complex global analyses on massive snapshots to capture only what has changed is inefficient. We discuss analysis algorithms for streaming graph data that maintain both local and global metrics. We extract parallelism from both analysis kernel and graph data to scale performance to real-world sizes.
An increasingly fast-paced, digital world has produced an ever-growing volume of petabyte-sized datasets. At the same time, terabytes of new, unstructured data arrive daily. As the desire to ask more detailed questions about these massive streams has grown, parallel software and hardware have only recently begun to enable complex analytics in this non-scientific space. In this tutorial, we will discuss the open problems facing us with analyzing this "data deluge". We will present algorithms and data structures capable of analyzing spatio-temporal data at massive scale on parallel systems. We will try to understand the difficulties and bottlenecks in parallel graph algorithm design on current systems and will show how multithreaded and hybrid systems can overcome these challenges. We will demonstrate how parallel graph algorithms can be implemented on a variety of architectures using different programming models. The goal of this tutorial is to provide a comprehensive introduction to the field of parallel graph analysis to an audience with computing background, interested in participating in research and/or commercial applications of this field. Moreover, we will cover leading-edge technical and algorithmic developments in the field and discuss open problems and potential solutions.
Tackling the current volume of graph-structured data requires parallel tools. We extend our work on analyzing such massive graph data with a massively parallel algorithm for community detection that scales to current data sizes, clustering a real-world graph of over 100 million vertices and over 3 billion edges in under 500 seconds on a four-processor Intel E7-8870-based server. Our algorithm achieves moderate parallel scalability without sacrificing sequential operational complexity. Community detection partitions a graph into subgraphs more densely connected within the subgraph than to the rest of the graph. We take an agglomerative approach similar to Clauset, Newman, and Moore’s sequential algorithm, merging pairs of connected intermediate subgraphs to optimize different graph properties. Working in parallel opens new approaches to high performance. We improve performance of our parallel community detection algorithm on both the Cray XMT2 and OpenMP platforms and adapt our algorithm to the DIMACS Implementation Challenge data set.
Tackling the current volume of graph-structured data requires parallel tools. We extend our work on analyzing such massive graph data with the first massively parallel algorithm for community detection that scales to current data sizes, scaling to graphs of over 122 million vertices and nearly 2 billion edges in under 7300 seconds on a massively multithreaded Cray XMT. Our algorithm achieves moderate parallel scalability without sacrificing sequential operational complexity. Community detection partitions a graph into subgraphs more densely connected within the subgraph than to the rest of the graph. We take an agglomerative approach similar to Clauset, Newman, and Moore's sequential algorithm, merging pairs of connected intermediate subgraphs to optimize different graph properties. Working in parallel opens new approaches to high performance. On smaller data sets, we find the output's modularity compares well with the standard sequential algorithms.
Current online social networks are massive and still growing. For example, Facebook has over 500 million active users sharing over 30 billion items per month. The scale within these data streams has outstripped traditional graph analysis methods. Monitoring requires dynamic analysis rather than repeated static analysis. The massive state behind multiple persistent queries requires shared data structures and not problem-specific representations. We present a framework based on the STINGER data structure that can monitor a global property, connected components, on a graph of 16 million vertices at rates of up to 240000 updates per second on a 32 processor Cray XMT. For very large scale-free graphs, our implementation uses novel batching techniques that exploit the scale-free nature of the data and run over three times faster than prior methods. Our framework handles, for the first time, real-world data rates, opening the door to higher-level analytics such as community and anomaly detection.
An increasingly fast-paced, digital world has produced an ever-growing volume of petabyte-sized datasets. At the same time, terabytes of new, unstructured data arrive daily. As the desire to ask more detailed questions about these massive streams has grown, parallel software and hardware have only recently begun to enable complex analytics in this non-scientific space. In this tutorial, we will discuss the open problems facing us with analyzing this "data deluge". We will present algorithms and data structures capable of analyzing spatio-temporal data at massive scale on parallel systems. We will try to understand the difficulties and bottlenecks in parallel graph algorithm design on current systems and will show how multithreaded and hybrid systems can overcome these challenges. We will demonstrate how parallel graph algorithms can be implemented on a variety of architectures using different programming models. The goal of this tutorial is to provide a comprehensive introduction to the field of parallel graph analysis to an audience with computing background, interested in participating in research and/or commercial applications of this field. Moreover, we will cover leading-edge technical and algorithmic developments in the field and discuss open problems and potential solutions.
Analyzing massive social networks challenges both high-performance computers and human understanding. These massive networks cannot be visualized easily, and their scale makes applying complex analysis methods computationally expensive. We present a region-growing method for finding a smaller, more tractable subgraph, a community, given a few example seed vertices. Unlike existing work, we focus on a small number of seed vertices, from two to a few dozen. We also present the first comparison between five algorithms for expanding a small seed set into a community. Our comparison applies these algorithms to an R-MAT generated graph component with 240 thousand vertices and 32 million edges and evaluates the community size, modularity, Kullback-Leibler divergence, conductance, and clustering coefficient. We find that our new algorithm with a local modularity maximizing heuristic based on Clauset, Newman, and Moore performs very well when the output is limited to 100 or 1000 vertices. When run without a vertex size limit, a heuristic from McCloskey and Bader generates communities containing around 60% of the graph's vertices and having a small conductance and modularity appropriate to the result size. A personalized PageRank algorithm based on Andersen, Lang, and Chung also performs well with respect to our metrics.
Social networks produce an enormous quantity of data. Facebook consists of over 400 million active users sharing over 5 billion pieces of information each month. Analyzing this vast quantity of unstructured data presents challenges for software and hardware. We present GraphCT, a Graph Characterization Tooklit for massive graphs representing social network data. On a 128-processor Cray XMT, GraphCT estimates the betweenness centrality of an artificially generated (R-MAT) 537 million vertex, 8.6 billion edge graph in 55 minutes. We use GraphCT to analyze public data from Twitter, a microblogging network. Twitter's message connections appear primarily tree-structured as a news dissemination system. Within the public data, however, are clusters of conversations. Using GraphCT, we can rank actors within these conversations and help analysts focus attention on a much smaller data subset.
We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-offs that extract the parallelism necessary for high-performance updating analysis of massive graphs. Static analysis kernels often rely on storing input data in a specific structure. Maintaining these structures for each possible kernel with high data rates incurs a significant performance cost. A case study computing clustering coefficients on a general-purpose data structure demonstrates incremental updates can be more efficient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coefficients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom filter. On 32 processors of a with a synthetic scale-free graph of 2^{24} ≈16 million vertices and 2^{29} ≈537 million edges, the brute-force method processes a mean of over 50000 updates per second and our Bloom filter approaches 200000 updates per second.
Solving a square linear system Ax=b often is considered a black box. It's supposed to "just work," and failures often are blamed on the original data or subtleties of floating-point. Now that we have an abundance of cheap computations, however, we can do much better. A little extra precision in just the right places produces accurate solutions cheaply or demonstrates when problems are too hard to solve without significant cost. This talk will outline the method, iterative refinement with a new twist; the benefits, small backward and forward errors; and the trade-offs and unexpected benefits.
The Householder reflections used in LAPACK's QR factorization leave positive and negative real entries along R's diagonal. This is sufficient for most applications of QR factorizations, but a few require that R have a nonnegative diagonal. This note describes a new Householder generation routine to produce a nonnegative diagonal. Additionally, we find that scanning for trailing zeros in the generated reflections leads to large performance improvements when applying reflections with many trailing zeros. Factoring low-profile matrices, those with nonzero entries mostly near the diagonal (e.g., band matrices), now require far fewer operations. For example, QR factorization of matrices with profile width b that are stored densely in an n×n matrix improves from O(n^{3}) to O(n^{2}+nb^{2}). These routines are in LAPACK 3.2.
Keywords: LAPACK; QR factorization; Householder reflection; floating-point
We present the algorithm, error bounds, and numerical results for extra-precise iterative refinement applied to overdetermined linear least squares (LLS) problems. We apply our linear system refinement algorithm to Björck’s augmented linear system formulation of an LLS problem. Our algorithm reduces the forward normwise and componentwise errors to O(ɛ) unless the system is too ill conditioned. In contrast to linear systems, we provide two separate error bounds for the solution x and the residual r. The refinement algorithm requires only limited use of extra precision and adds only O(mn) work to the O(mn^{2}) cost of QR factorization for problems of size m-by-n. The extra precision calculation is facilitated by the new extended-precision BLAS standard in a portable way, and the refinement algorithm will be included in a future release of LAPACK and can be extended to the other types of least squares problems.
This standard specifies interchange and arithmetic formats and methods for binary and decimal floating-point arithmetic in computer programming environments. This standard specifies exception conditions and their default handling. An implementation of a floating-point system conforming to this standard may be realized entirely in software, entirely in hardware, or in any combination of software and hardware. For operations specified in the normative part of this standard, numerical results and exceptions are uniquely determined by the values of the input data, sequence of operations, and destination formats, all under user control.
Keywords: IEEE standards;floating point arithmetic;programming;IEEE standard;arithmetic formats;computer programming;decimal floating-point arithmetic;754-2008;NaN;arithmetic;binary;computer;decimal;exponent;floating-point;format;interchange;number;rounding;significand;subnormal
The Householder reflections used in LAPACK's QR factorization leave positive and negative real entries along R's diagonal. This is sufficient for most applications of QR factorizations, but a few require that R have a nonnegative diagonal. This note describes a new Householder generation routine to produce a nonnegative diagonal. Additionally, we find that scanning for trailing zeros in the generated reflections leads to large performance improvements when applying reflections with many trailing zeros. Factoring low-profile matrices, those with nonzero entries mostly near the diagonal (e.g., band matrices), now require far fewer operations. For example, QR factorization of matrices with profile width b that are stored densely in an n×n matrix improves from O(n^{3}) to O(n^{2}+nb^{2}). These routines are in LAPACK 3.2.
We present the algorithm, error bounds, and numerical results for extra-precise iterative refinement applied to overdetermined linear least squares (LLS) problems. We apply our linear system refinement algorithm to Björck’s augmented linear system formulation of an LLS problem. Our algorithm reduces the forward normwise and componentwise errors to O(ɛ) unless the system is too ill conditioned. In contrast to linear systems, we provide two separate error bounds for the solution x and the residual r. The refinement algorithm requires only limited use of extra precision and adds only O(mn) work to the O(mn^{2}) cost of QR factorization for problems of size m-by-n. The extra precision calculation is facilitated by the new extended-precision BLAS standard in a portable way, and the refinement algorithm will be included in a future release of LAPACK and can be extended to the other types of least squares problems.
Linear least squares (LLS) fitting is the most widely used data modeling technique and is included in almost every data analysis system (e.g. spreadsheets). These software systems often give no feedback on the conditioning of the LLS problem or the floating-point calculation errors present in the solution. With limited use of extra precision, we can eliminate these concerns for all but the most ill-conditioned LLS problems. Our algorithm provides either a solution and residual with relatively tiny error or a notice that the LLS problem is too ill-conditioned.
Bisection is one of the most common methods used to compute the eigenvalues of symmetric tridiagonal matrices. Bisection relies on the Sturm count: For a given shift sigma, the number of negative pivots in the factorization T - σI = LDL^{T} equals the number of eigenvalues of T that are smaller than sigma. In IEEE-754 arithmetic, the value ∞ permits the computation to continue past a zero pivot, producing a correct Sturm count when T is unreduced. Demmel and Li showed [IEEE Trans. Comput., 43 (1994), pp. 983–992] that using ∞ rather than testing for zero pivots within the loop could significantly improve performance on certain architectures. When eigenvalues are to be computed to high relative accuracy, it is often preferable to work with LDL^{T} factorizations instead of the original tridiagonal T. One important example is the MRRR algorithm. When bisection is applied to the factored matrix, the Sturm count is computed from LDL^{T} which makes differential stationary and progressive qds algorithms the methods of choice. While it seems trivial to replace T by LDL^{T}, in reality these algorithms are more complicated: In IEEE-754 arithmetic, a zero pivot produces an overflow followed by an invalid exception (NaN, or “Not a Number”) that renders the Sturm count incorrect. We present alternative, safe formulations that are guaranteed to produce the correct result. Benchmarking these algorithms on a variety of platforms shows that the original formulation without tests is always faster provided that no exception occurs. The transforms see speed-ups of up to 2.6x over the careful formulations. Tests on industrial matrices show that encountering exceptions in practice is rare. This leads to the following design: First, compute the Sturm count by the fast but unsafe algorithm. Then, if an exception occurs, recompute the count by a safe, slower alternative. The new Sturm count algorithms improve the speed of bisection by up to 2x on our test matrices. Furthermore, unlike the traditional tiny-pivot substitution, proper use of IEEE-754 features provides a careful formulation that imposes no input range restrictions.
The purpose of this document is to facilitate contributions to LAPACK and ScaLAPACK by documenting their design and implementation guidelines. The long-term goal is to provide guidelines for both LAPACK and ScaLAPACK. However, the parallel ScaLAPACK code has more open issues, so this document primarily concerns LAPACK.
We present the design and testing of an algorithm for iterative refinement of the solution of linear equations where the residual is computed with extra precision. This algorithm was originally proposed in 1948 and analyzed in the 1960s as a means to compute very accurate solutions to all but the most ill-conditioned linear systems. However, two obstacles have until now prevented its adoption in standard subroutine libraries like LAPACK: (1) There was no standard way to access the higher precision arithmetic needed to compute residuals, and (2) it was unclear how to compute a reliable error bound for the computed solution. The completion of the new BLAS Technical Forum Standard has essentially removed the first obstacle. To overcome the second obstacle, we show how the application of iterative refinement can be used to compute an error bound in any norm at small cost and use this to compute both an error bound in the usual infinity norm, and a componentwise relative error bound.
LAPACK and ScaLAPACK are widely used software libraries for numerical linear algebra. There have been over 68M web hits at www.netlib.org for the associated libraries LAPACK, ScaLAPACK, CLAPACK and LAPACK95. LAPACK and ScaLAPACK are used to solve leading edge science problems and they have been adopted by many vendors and software providers as the basis for their own libraries, including AMD, Apple (under Mac OS X), Cray, Fujitsu, HP, IBM, Intel, NEC, SGI, several Linux distributions (such as Debian), NAG, IMSL, the MathWorks (producers of MATLAB), Interactive Supercomputing, and PGI. Future improvements in these libraries will therefore have a large impact on users.
For sparse LU factorization, dynamic pivoting tightly couples symbolic and numerical computation. Dynamic structural changes limit parallel scalability. Demmel and Li use static pivoting in distributed SuperLU for performance, but intentionally perturbing the input may lead silently to erroneous results. Are there experimentally stable static pivoting heuristics that lead to a dependable direct solver? The answer is currently a qualified yes. Current heuristics fail on a few systems, but all failures are detectable.
We are planning new releases of the widely used LAPACK and ScaLAPACK numerical linear algebra libraries. Based on an on-going user survey (http://www.netlib.org/lapack-dev) and research by many people, we are proposing the following improvements: Faster algorithms (including better numerical methods, memory hierarchy optimizations, parallelism, and automatic performance tuning to accomodate new architectures), more accurate algorithms (including better numerical methods, and use of extra precision), expanded functionality (including updating and downdating, new eigenproblems, etc. and putting more of LAPACK into ScaLAPACK), and improved ease of use (friendlier interfaces in multiple languages). To accomplish these goals we are also relying on better software engineering techniques and contributions from collaborators at many institutions. This is joint work with Jack Dongarra.
Bisection is one of the most common methods used to compute the eigenvalues of symmetric tridiagonal matrices. Bisection relies on the Sturm count: For a given shift sigma, the number of negative pivots in the factorization T - σI = LDL^{T} equals the number of eigenvalues of T that are smaller than sigma. In IEEE-754 arithmetic, the value ∞ permits the computation to continue past a zero pivot, producing a correct Sturm count when T is unreduced. Demmel and Li showed [IEEE Trans. Comput., 43 (1994), pp. 983–992] that using ∞ rather than testing for zero pivots within the loop could significantly improve performance on certain architectures. When eigenvalues are to be computed to high relative accuracy, it is often preferable to work with LDL^{T} factorizations instead of the original tridiagonal T. One important example is the MRRR algorithm. When bisection is applied to the factored matrix, the Sturm count is computed from LDL^{T} which makes differential stationary and progressive qds algorithms the methods of choice. While it seems trivial to replace T by LDL^{T}, in reality these algorithms are more complicated: In IEEE-754 arithmetic, a zero pivot produces an overflow followed by an invalid exception (NaN, or “Not a Number”) that renders the Sturm count incorrect. We present alternative, safe formulations that are guaranteed to produce the correct result. Benchmarking these algorithms on a variety of platforms shows that the original formulation without tests is always faster provided that no exception occurs. The transforms see speed-ups of up to 2.6x over the careful formulations. Tests on industrial matrices show that encountering exceptions in practice is rare. This leads to the following design: First, compute the Sturm count by the fast but unsafe algorithm. Then, if an exception occurs, recompute the count by a safe, slower alternative. The new Sturm count algorithms improve the speed of bisection by up to 2x on our test matrices. Furthermore, unlike the traditional tiny-pivot substitution, proper use of IEEE-754 features provides a careful formulation that imposes no input range restrictions.
The entire process of creating and executing applications that solve interesting problems with acceptable cost and accuracy involves a complex interaction among hardware, system software, programming environments, mathematical software libraries, and applications software, all mediated by standards for arithmetic, operating systems, and programming environments. This panel will discuss various issues arising among these various contending points of view, sometimes from the point of view of issues raised during the current IEEE 754R standards revision effort.
We present the design and testing of an algorithm for iterative refinement of the solution of linear equations, where the residual is computed with extra precision. This algorithm was originally proposed in the 1960s [6, 22] as a means to compute very accurate solutions to all but the most ill-conditioned linear systems of equations. However two obstacles have until now prevented its adoption in standard subroutine libraries like LAPACK: (1) There was no standard way to access the higher precision arithmetic needed to compute residuals, and (2) it was unclear how to compute a reliable error bound for the computed solution. The completion of the new BLAS Technical Forum Standard [5] has recently removed the first obstacle. To overcome the second obstacle, we show how a single application of iterative refinement can be used to compute an error bound in any norm at small cost, and use this to compute both an error bound in the usual infinity norm, and a componentwise relative error bound. We report extensive test results on over 6.2 million matrices of dimension 5, 10, 100, and 1000. As long as a normwise (resp. componentwise) condition number computed by the algorithm is less than 1 / max{10, √(n)}ɛ_{w} , the computed normwise (resp. componentwise) error bound is at most 2 max{10, √(n)} ⋅ ɛ_{w} , and indeed bounds the true error. Here, n is the matrix dimension and ɛ_{w} is single precision roundoff error. For worse conditioned problems, we get similarly small correct error bounds in over 89.4% of cases.
Increasingly, sparse matrix applications produce matrices too large for a single computer's memory. Distributed, parallel computers provide an avenue around memory limitations, but distributing combinatorial algorithms is historically difficult. We use insights from combinatorial optimization to design loosely coupled algorithms for sparse matrix matching, ordering, and symbolic factorization. These algorithms' performance depends on both problem instance and computer architecture. We investigate these aspects of performance and demonstrate issues that affect distributed combinatorial computing.
Bipartite matching is one of graph theory's workhorses, occuring in the solution or approximation of many problems. Increasingly, applications' data spans multiple memory spaces, but there is little recent experience with distributed matching algorithms. We present a distributed, parallel implementation for weighted bipartite matching based on Bertsekas's auction algorithm. The bidding process finds local matchings while summarizing updates for occasional communication, leading to superlinear speed-ups on some sparse problems and modest performance on others.
Traditional pivoting during parallel, unsymmetric LU factorization introduces heavy communication and restructuring costs. Possible alternatives include pre-pivoting to place heavy elements along the diagonal and limited pivoting that maintains the factors' structures. Each alternative comes with trade-offs that affect accuracy and performance.
Practical and efficient methods exist for parallelizing the numerical work in sparse matrix calculations. The initial symbolic analysis is now becoming a sequential bottleneck, limiting problems' sizes. One such analysis is the weighted bipartite matching used to achieve scalable, unsymmetric LU factorization in Superlu. Applying a mathematical optimization algorithm produces a distributed-memory implementation with explicit trade-offs between speed and matching quality. We present accuracy and performance results for this phase alone and in the context of Superlu.
Floating-point arithmetic is often seen as untrustworthy. We show how manipulating precisions according to the following rules of thumb enhances the reliability of and removes surprises from calculations: Store data narrowly, compute intermediates widely, and derive properties widely. Further, we describe a typing system for floating point that both supports and is supported by these rules. A single type is established for all in- termediate computations. The type describes a precision at least as wide as all inputs to and results from the computation. Picking a single type provides benefits to users, compilers, and interpreters. The type system also extends cleanly to encompass intervals and higher precisions.
The fundamental constraint on a networked sensor is its energy consumption, since it may be either impossible or not feasible to replace its energy source. We analyze the power dissipation implications of implementing the network sensor with either a central processor switching between I/O devices or a family of processors, each dedicated to a single device. We present the energy measurements of the current generations of networked sensors, and develop an abstract description of tradeoffs between both designs.
The Tera Multithreaded Architecture, or MTA, addresses scalable shared memory system design with a difierent approach; it tolerates latency through providing fast access to multiple threads of execution. The MTA employs a number of radical design ideas: creation of hardware threads (streams) with frequent context switching; full-empty bits for each memory word; a flat memory hierarchy; and deep pipelines. Recent evaluations of the MTA have taken a top-down approach: port applications and application benchmarks, and compare the absolute performance with conventional systems. While useful, these studies do not reveal the effect of the Tera MTA's unique hardware features on an application. We present a bottom-up approach to the evaluation of the MTA via a suite of microbenchmarks to examine in detail the underlying hardware mechanisms and the cost of runtime system support for multithreading. In particular, we measure memory, network, and instruction latencies; memory bandwidth; the cost of low-level synchronization via full-empty bits; overhead for stream management; and the effects of software pipelining. These data should provide a foundation for performance modeling on the MTA. We also present results for list ranking on the MTA, an application which has traditionally been difficult to scale on conventional parallel systems.
SIMD parallel computers have been employed for image related applications since their inception. They have been leading the way in improving processing speed for those applications. However, current parallel programming technologies have not kept pace with the performance growth and cost decline of parallel hardware. A highly usable parallel software development environment is needed. This chapter presents a computing environment that integrates a SIMD mesh architecture with image algebra for high-performance image processing applications. The environment describes parallel programs through a machine-independent, retargetable image algebra object library that supports SIMD execution on the Lockheed Martin PAL-I parallel computer. Program performance on this machine is improved through on-the-fly execution analysis and scheduling. We describe the relevant elements of the system structure, outline the scheme for execution analysis, and provide examples of the current cost model and scheduling system.
SIMD parallel systems have been employed for image processing and computer vision applications since their inception. This paper describes a system in which parallel programs are implemented using a machine-independent, retargetable object library that provides SIMD execution on the Lockheed Martin PAL-I SIMD parallel processor. Programs' performance on this machine is improved through on-the-fly execution analysis and scheduling. We describe the relevant elements of the system structure, the general scheme for execution analysis, and the current cost model for scheduling.
This file was generated by bibtex2html 1.98.