{"id":9712,"date":"2024-05-16T14:12:21","date_gmt":"2024-05-16T14:12:21","guid":{"rendered":"https:\/\/www.ntspl.co.in\/blog\/?p=9712"},"modified":"2024-09-12T06:41:35","modified_gmt":"2024-09-12T06:41:35","slug":"introducing-dataflux-dataset-for-cloud-storage-to-accelerate-pytorch-ai-training","status":"publish","type":"post","link":"https:\/\/www.ntspl.co.in\/blog\/introducing-dataflux-dataset-for-cloud-storage-to-accelerate-pytorch-ai-training\/","title":{"rendered":"Introducing Dataflux Dataset for Cloud Storage to accelerate PyTorch AI training"},"content":{"rendered":"<div class=\"block-paragraph_advanced\">\n<h3><strong>Introduction<\/strong><\/h3>\n<p>Machine learning (ML) models thrive on massive datasets, and fast data loading is key for cost-effective ML training. We recently launched a PyTorch Dataset abstraction, the <a href=\"https:\/\/github.com\/GoogleCloudPlatform\/dataflux-pytorch\" target=\"_blank\" rel=\"noopener noreferrer\">Dataflux Dataset<\/a>, for accelerating data loading from Google\u2019s Cloud Storage. Dataflux provides up to 3.5x faster training times compared to fsspec, with small files.<\/p>\n<p>Today\u2019s launch builds upon Google\u2019s commitment to open standards that spans over two decades of OSS contributions like TensorFlow, JAX, TFX, MLIR, KubeFlow, and Kubernetes, as well as sponsorship for critical OSS data science initiatives like Project Jupyter and NumFOCUS.<\/p>\n<p>We also validated the Dataflux Dataset on <a href=\"https:\/\/github.com\/argonne-lcf\/dlio_benchmark\" target=\"_blank\" rel=\"noopener noreferrer\">Deep Learning IO (DLIO) benchmarks<\/a> and realized similar performance gains, even with larger files. Due to this broad performance boost, we recommend using Dataflux Dataset over other libraries or direct Cloud Storage API calls for training workflows.<\/p>\n<p>Key Dataflux Dataset features include:<\/p>\n<ul>\n<li>\n<p role=\"presentation\"><strong>Direct Cloud Storage integration:<\/strong> Eliminate the need to download data locally first.<\/p>\n<\/li>\n<li>\n<p role=\"presentation\"><strong>Performance optimization:<\/strong> Achieve up to 3.5x faster training times, especially with small files.<\/p>\n<\/li>\n<li>\n<p role=\"presentation\"><strong>PyTorch Dataset primitive:<\/strong> Work seamlessly with familiar PyTorch concepts.<\/p>\n<\/li>\n<li>\n<p role=\"presentation\"><strong>Checkpointing support:<\/strong> Save and load model checkpoints directly to\/from Cloud Storage.<\/p>\n<\/li>\n<\/ul>\n<h3><strong>Using Dataflux Datasets<\/strong><\/h3>\n<ol>\n<li>\n<p role=\"presentation\"><strong>Prerequisites:<\/strong> Python 3.8+<\/p>\n<\/li>\n<li>\n<p role=\"presentation\"><strong>Installation: <\/strong><strong>$ <\/strong><code>pip install gcs-torch-dataflux<\/code><\/p>\n<\/li>\n<li>\n<p role=\"presentation\"><strong>Authentication:<\/strong> Use Google Cloud <a href=\"https:\/\/cloud.google.com\/docs\/authentication\/provide-credentials-adc\">application-default authentication<\/a><\/p>\n<\/li>\n<\/ol>\n<p><strong>Example: Loading images for training<\/strong><\/p>\n<p>There are only a few changes needed to enable the Dataflux Dataset. If you\u2019re using PyTorch and have data in Cloud Storage, you most likely have written your own Dataset implementation. The below snippet shows how easy it is to create a Dataflux Dataset. For further details, checkout our <a href=\"https:\/\/github.com\/GoogleCloudPlatform\/dataflux-pytorch\/tree\/main\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub<\/a> page.<\/p>\n<\/div>\n<div class=\"block-code\">\n<dl>\n<dt>code_block<\/dt>\n<dd>&lt;ListValue: [StructValue([(&#8216;code&#8217;, &#8216;import numpyrnimport iornfrom PIL import Imagernfrom dataflux_pytorch import dataflux_mapstyle_datasetrnrndef transform(img_in_bytes): rn return numpy.asarray(rnImage.open(io.BytesIO(img_in_bytes)))rnrndataset = dataflux_mapstyle_dataset.DatafluxMapStyleDataset(rn project_name=PROJECT_NAME,rn bucket_name=BUCKET_NAME,rn config=dataflux_mapstyle_dataset.Config(prefix=PREFIX),rn data_format_fn=transform,rn)rnrn# Use &#8220;dataset&#8221; as usual in your ML-Training loop in combination with PyTorch DataLoader.&#8217;), (&#8216;language&#8217;, &#8221;), (&#8216;caption&#8217;, &lt;wagtail.rich_text.RichText object at 0x3dfdd6a5d280&gt;)])]&gt;<\/dd>\n<\/dl>\n<\/div>\n<div class=\"block-paragraph_advanced\">\n<h3><strong>Under the hood<\/strong><\/h3>\n<p>To achieve such significant performance gains for Dataflux, we addressed the data-loading performance bottlenecks in ML training workflows. In a training run, data is loaded in batches from storage, and after some processing, is sent from CPU to GPU for ML-Training computations. If reading and constructing a batch takes longer than GPU computation, then the GPU is effectively stalled and underutilized, leading to longer training times.<\/p>\n<p>When data is in a cloud-based object storage system (like Google\u2019s Cloud Storage), it takes longer to fetch the data than from a local disk, especially if the data is in small objects. This is due to time-to-first-byte latency. Once an object is \u2018opened\u2019 though, the cloud storage platform provides high throughput. In Dataflux, we employ a Cloud Storage feature called <a href=\"https:\/\/cloud.google.com\/storage\/docs\/composing-objects\">Compose Objects<\/a> that can dynamically combine many smaller objects into a larger object. Then, instead of fetching (say) 1024 small objects (batch size), we only fetch 30 larger objects and download those to memory. The larger objects are then decomposed back to their individual smaller objects and served back as the dataset-samples. Any temporary composed objects created in the process are also cleaned up.<\/p>\n<p>Another optimization that Dataflux Datasets employs is high-throughput parallel-listing, speeding up the initial metadata needed for the dataset. Dataflux uses a sophisticated algorithm called work-stealing to significantly speed up listings; with it, even the first AI training run, or \u201cepoch,\u201d is faster compared to Dataflux Datasets without parallel-listing, even on datasets that have tens of millions of objects.<\/p>\n<p>Together, fast-listing and dynamic-composition help ensure that ML-training with Dataflux leads to minimal GPU stalls, leading to greatly reduced training time and increased accelerator utilization.<\/p>\n<p>Fast-listing and dynamic-composition are part of the <a href=\"https:\/\/github.com\/GoogleCloudPlatform\/dataflux-client-python\" target=\"_blank\" rel=\"noopener noreferrer\">Dataflux Client Libraries<\/a> and available on <a href=\"https:\/\/github.com\/GoogleCloudPlatform\/dataflux-pytorch\/tree\/main\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub<\/a>. Dataflux Dataset uses these client libraries under the hood.<\/p>\n<h3><strong>Dataflux is available now<\/strong><\/h3>\n<p>Give the <a href=\"https:\/\/github.com\/GoogleCloudPlatform\/dataflux-pytorch\/tree\/main\" target=\"_blank\" rel=\"noopener noreferrer\">Dataflux Dataset for PyTorch<\/a> (or the <a href=\"https:\/\/github.com\/GoogleCloudPlatform\/dataflux-client-python\" target=\"_blank\" rel=\"noopener noreferrer\">Dataflux Python client library<\/a> if writing your own ML training dataset code) a try and <a href=\"mailto:dataflux-customer-support@google.com\">let us know<\/a> how it boosts your workflows!<\/p>\n<p>You can learn more about this and our other storage AI related capabilities from our Google Cloud Next \u201824 recorded session \u201cHow to define a storage infrastructure for AI and analytical workloads\u201d<\/p>\n<\/div>\n<div class=\"block-video\">\n<div class=\"article-module article-video \">\n<figure><img decoding=\"async\" src=\"https:\/\/img.youtube.com\/vi\/A4daQj9tnWk\/maxresdefault.jpg\" alt=\"How to define a storage infrastructure for AI and analytical workloads\" \/>&nbsp;<\/figure>\n<\/div>\n<div class=\"h-c-modal--video\" data-glue-modal=\"uni-modal-A4daQj9tnWk-\" data-glue-modal-close-label=\"Close Dialog\"><a class=\"glue-yt-video\" href=\"https:\/\/youtube.com\/watch?v=A4daQj9tnWk\" data-glue-yt-video-autoplay=\"true\" data-glue-yt-video-height=\"99%\" data-glue-yt-video-vid=\"A4daQj9tnWk\" data-glue-yt-video-width=\"100%\">\u00a0<\/a><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Machine learning (ML) models thrive on massive datasets, and fast data loading is key for cost-effective ML training. We recently launched a PyTorch Dataset abstraction, the Dataflux Dataset, for accelerating data loading from Google\u2019s Cloud Storage. Dataflux provides up to 3.5x faster training times compared to fsspec, with small files. Today\u2019s launch builds upon [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":9811,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[524,438],"tags":[],"class_list":["post-9712","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cloud","category-technology"],"acf":{"custom_meta_title":"Introducing Dataflux Dataset for Cloud Storage","meta_description":"Enhance PyTorch AI training with Dataflux Dataset for Cloud Storage. Streamline data access and accelerate training for faster, more efficient results.","meta_keyword":"","other_meta_tag":"<meta property=\"og:title\" content=\"Introducing Dataflux Dataset for Cloud Storage\">\r\n<meta property=\"og:site_name\" content=NTSPL>\r\n<meta property=\"og:url\" content=https:\/\/www.ntspl.co.in\/blog\/introducing-dataflux-dataset-for-cloud-storage-to-accelerate-pytorch-ai-training\/>\r\n<meta property=\"og:description\" content=Enhance PyTorch AI training with Dataflux Dataset for Cloud Storage. Streamline data access and accelerate training for faster, more efficient results.>\r\n<meta property=\"og:type\" content=\"Article\">\r\n<meta property=\"og:image\" content=https:\/\/www.ntspl.co.in\/blog\/wp-content\/uploads\/2024\/05\/Blog-Image-data-flux.jpg>\r\n\r\n<meta name=\"twitter:site\" content=\"@NTSPL\">\r\n<meta name=twitter:card content=\"summary\" \/>\r\n<meta name=twitter:description content=\"Enhance PyTorch AI training with Dataflux Dataset for Cloud Storage. Streamline data access and accelerate training for faster, more efficient results\"\/>\r\n<meta name=twitter:title content=\"Introducing Dataflux Dataset for Cloud Storage\"\/>"},"_links":{"self":[{"href":"https:\/\/www.ntspl.co.in\/blog\/wp-json\/wp\/v2\/posts\/9712"}],"collection":[{"href":"https:\/\/www.ntspl.co.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ntspl.co.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ntspl.co.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ntspl.co.in\/blog\/wp-json\/wp\/v2\/comments?post=9712"}],"version-history":[{"count":2,"href":"https:\/\/www.ntspl.co.in\/blog\/wp-json\/wp\/v2\/posts\/9712\/revisions"}],"predecessor-version":[{"id":10524,"href":"https:\/\/www.ntspl.co.in\/blog\/wp-json\/wp\/v2\/posts\/9712\/revisions\/10524"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.ntspl.co.in\/blog\/wp-json\/wp\/v2\/media\/9811"}],"wp:attachment":[{"href":"https:\/\/www.ntspl.co.in\/blog\/wp-json\/wp\/v2\/media?parent=9712"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ntspl.co.in\/blog\/wp-json\/wp\/v2\/categories?post=9712"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ntspl.co.in\/blog\/wp-json\/wp\/v2\/tags?post=9712"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}