Advanced Performance and Massive Scaling Driven by AI and DL
Artificial Intelligence (AI) is rapidly becoming an essential business and research tool, giving organizations valuable insights into their data and doing so with unprecedented velocity and accuracy. The attraction of AI is its ability to facilitate breakthrough innovations across a variety of fields while delivering significant acceleration in time to insight.
Given the vast possibilities and potential advantages it can unleash, tremendous resources are being invested by enterprises, universities and government organizations to further develop and benefit from AI as well as Deep Learning (DL) applications. With AI technology, real-time fraud detection protects our shopping and internet transactions, natural language translators remove language barriers, augmented reality delivers a far richer entertainment experience, drug discovery gets accelerated, personalized medicine and remote health diagnostics are fully enabled, and autonomous vehicles can circulate unassisted in our cities.
Many AI and DL applications are built upon artificial neural networks (ANNs) that are trained to extract valuable information from the massive data sets presented to them. A specialized AI-software framework will typically scan millions of parameters and billions or trillions of samples to rapidly define and connect separate layers of nodes together, thereby establishing a data flow that yields valuable conclusions and powerful results.
At the core of AI is the training process where scale and complexity can expand greatly depending on the software framework employed, strategies selected, types and quantity of available data, and scope of capability desired in the neural network. Achieving reliable and quick inference requires a highly iterative training process in which neural network candidates, built from multiple variations of the hyper parameters, are made to run through many complete passes of the data sets. Referred to as epochs, each pass through the data set helps grow and refine the neural network in training.
This training process is essential in achieving the desired AI and DL result, but it requires immense I/O, data storage, and computational resources. This is specifically apparent when it comes to unstructured and highly diverse data sets. There is a significant difference in the amount of I/O and storage required for the analysis of structured data and diverse, unstructured data sets. Take two very different AI workloads in retail as an example. Generally, consumer behavior analytics can be neatly categorized into relatively small databases, which make them great candidates for AI applications running in the cloud. On the other hand, frictionless retail and AI enabled check-out free stores rely on an array of sensors and data from video, RFID and other methods of data acquisition to power their analytics. In this case, the need for real-time ingest and analysis of large volumes of data speak to the capabilities of a local AI-capable infrastructure. Parallelizing the training process facilitates and accelerates the execution of multiple candidate instances, enabling the simultaneous creation of an ensemble of trained network possibilities, which can then be quickly compared and shrunk down to an optimal candidate.
The AI-Enabled Data Center
As noted, certain AI workloads are putting significant strain on the underlying I/O, storage, compute and network. An AI-enabled data center must be able to concurrently and efficiently service the entire spectrum of activities involved in the AI and DL process, including data ingest, training and inference.
The IT infrastructure supporting an AI-enabled data center must adapt and scale rapidly as data volumes grow, and as application workloads become more intense, complex and diverse. In order to provide more accurate answers, faster, the infrastructure must be efficient and reliable, with the capability to seamlessly and continuously handle transitions between different phases of experimental training and production inference. In short, the IT infrastructure is key to realizing the full potential of AI and DL operations in business and research.
Current enterprise and research data center IT infrastructures are woefully inadequate in handling the demanding needs of AI and DL. Designed to handle modest workloads, minimal scalability, limited performance needs and small data volumes, these platforms are highly bottlenecked and lack the fundamental capabilities needed for AI-enabled deployments.
At the same time, breakthrough technologies in processors and storage are acting as catalysts of effective AI Data center enablement. Graphical Processing Units (GPUs) deliver acceleration from slower CPUs, while Flash Enabled Parallel I/O Storage provides a significant performance boost to traditional hard-disk-based storage.
GPUs are significantly more scalable and faster than CPUs. Their large number of cores permits massively parallel execution of concurrent threads, which results in faster AI training, and quicker inference capabilities. GPUs enable DL applications to deliver better and more accurate answers significantly faster.
However, in order for GPUs to fulfill their promise of acceleration, data must be processed and delivered to the underlying AI applications with great speed, scalability, and consistently low latencies. This requires a parallel I/O storage platform for performance scalability and real-time data delivery and flash media for speed.
Data Storage Capabilities: Key to Maximizing AI Benefits
Without the right data storage platform, a GPU-based computing platform is just as bottle-necked and ineffectual as an antiquated non-AI-enabled data center. The proper selection of the data storage platform and its efficient integration in the data center infrastructure are key to eliminating AI bottlenecks and truly accelerating time to insight.
The right data storage system must deliver higher throughput, IOPS, and concurrency in order to prevent idling of precious GPU cycles. It must also be flexible and scalable in implementation, and enable efficient handling of a wide breadth of data sizes and types (including highly concurrent random streaming, a typical DL data set attribute).
Properly selected and implemented, such a data storage system will deliver the full potential of GPU computing platforms, accelerating time to insight at any scale and effortlessly handling every stage of the AI and DL process. This will not only execute AI and DL efforts reliably and efficiently, but most importantly it will deliver a cost-effective approach for facilitating breakthrough innovations.