Amazon recently launched a dataset library for Facebook’s PyTorch machine learning framework called S3 Plugin, designed to help data scientists access datasets stored in AWS S3 (Amazon Web Services Simple Storage Service). Designed for low latency, Amazon says the S3 Plugin provides the ability to stream data for datasets of any size, eliminating the need to provide local storage capacity.
PyTorch is an open source machine learning library based on the Torch library for applications such as computer vision and natural language processing, developed primarily by Facebook’s AI Research Lab. It is free and open source software released under a modified BSD license, with an underlying C++ implementation.
Many deep learning software are built on PyTorch, including Tesla Autopilot, Uber’s Pyro, and HuggingFace’s Transformers, etc. PyTorch provides two main advanced features.
- Tensor computation (e.g., NumPy), which is powerfully accelerated by the graphics processing unit (GPU)
- Deep neural networks built on type-based automatic differentiation systems
Since its release in October 2016, PyTorch has grown rapidly in the data science and developer communities. 2019 saw the number of contributors to the platform grow by more than 50% year-over-year to nearly 1,200. Every major AI conference in 2019 had a majority of papers implemented with PyTorch, and citations to PyTorch in papers grew by more than 194% in the first half of 2019, according to an analysis by Research Institute.
With this feature in the PyTorch Deep Learning container, users can leverage the PyTorch dataset and data loader APIs to work directly with data in S3 without first downloading it in local storage," Amazon wrote in a blog post. The S3 Plugin for PyTorch provides a native experience for using data from Amazon S3 to PyTorch without adding complexity to the code.
The benefits of the S3 Plugin include.
- PyTorch supports two different types of datasets, and the S3 Plugin for PyTorch has the flexibility to use both as you see fit.
- The S3 Plugin can train machine learning models using training data in a variety of formats. It is file format independent and renders objects on Amazon S3 as blob and can perform other transformations on input received from Amazon S3.
- The S3 Plugin provides a way to shuffle data in memory using
ShuffleDataset
or by providing input parametersshuffle_urls
when extendingS3IterableDataset
.
PyTorch’s S3 Plugin provides a way to transfer data from S3 in parallel, as well as support for streaming data from archived files. Amazon says that because the plugin is an implementation of PyTorch’s internal interface, it does not require modifications to existing code to work with S3.
The S3 Plugin for PyTorch improves the ease of use and flexibility of PyTorch, and can be used by interested developers via a pre-configured PyTorch Docker image, or directly from a GitHub repository.