What operating system is used by data scientists

Rapids for the analysis of huge amounts of data

Increase the performance of forecast models and data analytics

At the GTC Europe 2018 conference in Munich, Nvidia presented Rapids, a GPU acceleration platform for data science and machine learning that has been widely accepted by industry leaders, but also by the open source community and start-ups.

The open source software Rapids makes it possible to analyze huge amounts of data and to create precise forecasts at high speed. The manufacturer Nvidia promises data scientists an enormous boost in performance, as they can also tackle very complex business challenges with it. Examples are the prediction of credit card fraud, the inventory of goods in retail stores or the purchasing behavior of customers

Analysts estimate the data science and machine learning server market at $ 20 billion annually, which, when combined with scientific analysis and in-depth learning, brings the value of the high-performance computing market to about $ 36 billion. Given the growing importance of GPUs in data analysis, a number of companies are supporting Rapids - from pioneers in the open source community such as Databricks and Anaconda to IT corporations such as Hewlett Packard Enterprise, IBM, Oracle, SAP and SAS. Rapids, which is available for download on GitHub, will improve data usage immensely, believes Jeremy King, executive vice president and chief technology officer at Walmart. "This means that the most complex models can be scaled and provide even more precise forecasts."

"Data analytics and machine learning are the largest segments of the high-performance computing market that have not yet been accelerated," said Jensen Huang, founder and CEO of Nvidia, who introduced Rapids in his keynote at the GPU Technology Conference. "The largest corporations in the world use algorithms that use machine learning on a myriad of servers to identify complex patterns in their market and business environment and then optimize their businesses very quickly with precise predictions."

Based on the Cuda programming technology developed by Nvidia, with which program parts can be processed by the GPU, and in close cooperation with partners and the open source community, Nvidia engineers have, according to Huang, developed the new GPU over the past two years. Acceleration platform created that can also be integrated into the most popular data science libraries and workflows.

Turbo for machine learning

"We provide machine learning with a turbo-charge, as we have already done with deep learning," said the Nvidia boss. Rapids offers a number of open source libraries for fast analysis based on GPU, for machine learning and soon also for data visualization.
This gives scientists all the tools they need to run the entire data science pipeline on GPUs, Huang promises. The first benchmarks, in which the XGBoost algorithm for machine learning is used for training on one of the new DGX-2 systems, show accelerations by a factor of 50 compared to pure CPU systems. This allows data scientists to determine the typical training times depending on the size of the data set from days to hours or from hours to minutes.

Close collaboration with the open source community

Rapids builds on popular open source projects (including Apache Arrow, Pandas, and Scikit-Learn) by adding GPU acceleration to the Python data science toolchain. To integrate additional libraries and machine learning functions into Rapids, Nvidia is working with open source partners such as Anaconda, BlazingDB, Databricks, Quansight and Scikit-Learn, as well as Wes McKinney, head of Ursa Labs and creator of Apache Arrow and Pandas, the fastest growing Python data science library. "Nvidia's collaboration with Ursa Labs will accelerate the pace of innovation in Arrow's core libraries," said McKinney. That will speed up analytics and feature engineering considerably.

Broad acceptance is important

To ensure widespread acceptance, Rapids also integrates with Apache Spark, the open source framework for analytics and data science. Matei Zaharia, Databricks co-founder and chief technologist and original developer of Apache Spark, is excited about the potential for accelerating Spark workloads. "We have several ongoing projects to better integrate Spark with native accelerators, including Apache Arrow support and GPU planning with the Hydrogen project," says Zaharia. Rapids is a new way to scale data science and AI workloads.

Scott Collison, General Manager of Anaconda, is proud to have helped develop these new features, which will be available to the community of 7 million users of the Anaconda distribution through the public package repository. Rapids will also be integrated into Anaconda Enterprise in order to “give IT organizations of all sizes the opportunity to accelerate data science and AI workflows” on the basis of the DGX supercomputers.

Nima Negahban, co-founder and CTO of Kinetica, sees the new suite of open source libraries as a major improvement "because data scientists can leverage the power of the GPU across their model development toolchain." In this way, the training can be considerably simplified and the accuracy of the model improved without the data scientist having to invest a lot of effort in the redesign.

"SAP has worked closely with Nvidia over the past few years to use the acceleration of GPUs for many solutions," says Jürgen Müller, Chief Innovation Officer at SAP. B. on Leonardo Machine Learning. One will explore the possibilities that Rapids offers in charging the data science pipelines onto GPUs. This is an important step in bringing data science and machine learning into companies with Leonardo and Hana.

Cooperation with IBM

IBM, which also uses Nvidia GPUs in its Power9 processor, wants to incorporate the new open source software into its enterprise data science platform for local, hybrid and multicloud environments, so that it can be used by data scientists regardless of their preferred Deployment model available. With the addition of Rapids to the IBM portfolio, Bob Picciano, Senior Vice President of IBM Cognitive Systems, wants to “aggressively expand the performance limits of AI for our customers” and also use open source software for machine learning, such as Apache Arrow, Pandas and Scikit-Learn. Immediate, comprehensive ecosystem support for Rapids is provided by important open source employees such as Anaconda, BlazingDB, Graphistry, NERSC, PyData, Inria or Ursa Labs. According to Picciano, the following is planned:

  • PowerAI for Power9 to give data scientists more options with new open source machine learning and analysis libraries. The workloads are demonstrably accelerated by the special engineering from Nvidia and IBM at Power9, in particular by the built-in NVLink and Tesla Tensor Core GPUs. PowerAI is the software layer from IBM that is used to optimize today's analytics and AI workloads on heterogeneous computer systems.
  • Watson Studio and Watson Machine Learning to take advantage of GPUs so that data scientists and AI developers can create, implement and run faster than CPU-only implementations of their AI applications in a multicloud environment
  • IBM Cloud: Users who opt for machines with GPUs can use the accelerated machine learning libraries and analysis functions in Rapids for their cloud applications and thus use these machines for machine learning

At the beginning of this year, IBM set a record for machine learning - and exceeded the previous record holder by 46 times. Using a machine learning algorithm developed by IBM Research called Snap ML that runs on AC922 servers with Power9 processors and Tesla V100 Tensor Core GPUs, the researchers have an ML system for logistic regression classification in 91.5 seconds trained - based on a data set published by Criteo Labs with over 4 billion training examples.

Open source pioneer Red Hat on board

The open source pioneer Red Hat, which wants to be bought by IBM for 34 billion dollars, is also cooperating with Nvidia to bring its Linux and Kubernetes platforms such as Openshift to the GPUs. The Red Hat products are already available on the DGX-1 supercomputer. And in high-performance computing, IBM, Red Hat and Nividia deliver the technology and expertise to power two of the world's fastest supercomputers, Summit and Sierra.

According to Chris Wright, CTO at Red Hat, the goal of the cooperation is to trigger a new wave of open innovations around AI, deep learning and data science in the data centers. The certification for DGX-1 from Red Hat Enterprise Linux, which also runs on the Power9 servers, is an important step in this direction. It forms the basis for the expansion to Red Hat's other portfolio - including the Kubernetes-based container platform Openshift, which is used and jointly supported on Nvidia's AI supercomputers. Software houses and IT managers can transfer existing Linux applications to new Nvidia systems without having to make changes.

In addition to certifying Red Hat Enterprise Linux for DGX-1 systems, the companies plan to collaborate on other open source initiatives, including:

  • Container in the Nvidia GPU Cloud with Openshift: Red Hat and Nvidia plan to provide NGC containers that provide GPU-optimized software tools for AI and HPC based on Red Hat technologies; joint customers can thus optimally exploit the advantages of the GPUs
  • Heterogeneous Memory Management (HMM): Both manufacturers want to jointly advance the upstream development of the HMM function. This kernel function enables devices to access the contents of the main memory of another system and to mirror them in their own main memory, which significantly increases the performance of GPU applications.

According to CTO Chris Wright, the growing interest in AI and analytics workloads requires a different approach to enterprise computing than before, which Nvidia is already addressing at the architecture level. With the Linux operating system and the OpenShift container platform, the GPU hardware can be expanded to include software innovations suitable for companies. "These help users drive new workloads," said Wright. "At the same time, they retain the stability, reliability and familiarity that the production systems are used to."

As a specialist journalist, he has specialized in IT use in companies.