[Train][Doc] Update PyTorch Data Ingestion User Guide #45421

woshiyyya · 2024-05-17T22:04:44Z

Why are these changes needed?

Restructured the "Starting from PyTorch Data" section for better readability

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

matthewdeng

Nice! Love the added benefits & cleaner steps!

matthewdeng · 2024-05-17T23:47:11Z

doc/source/train/user-guides/data-loading-preprocessing.rst

-These utilities can still be used directly with Ray Train. In particular, you may want to do this if you already have your data ingestion pipeline set up.
-However, for more performant large-scale data ingestion we do recommend migrating to Ray Data.


IMO the previous documentation here was more clear.

Ah I actually changed the order and put the last 2 sentences into a later section. Previously, we mixed Ray Data and framework utilities, and discussed them together. This PR tries to disaggregate them and try to tell the story step-by-step:

Existing methods to ingest PyTorch data

The relationship between the existing methods and Ray Data

Why migrate to Ray Data (The benefits of Ray Data)

How to migrate to Ray Data

But of course this is just my preference, we can switch it back if it doesn't sounds better.

I see! We can definitely still do that, but in that case I think we should change the wording a bit in this line, and move the comparison table to the sub-section below. Right now it's just not super clear what it means that "You can still use these framework data utilities" and why there is a comparison table with Ray Data concepts.

matthewdeng · 2024-05-17T23:57:53Z

doc/source/train/user-guides/data-loading-preprocessing.rst


 .. tab-set::

-    .. tab-item:: PyTorch Dataset and DataLoader
+    .. tab-item:: PyTorch


The original names were more explicit to make it clear that this is referring to the dataset framework, rather than the training framework.

matthewdeng · 2024-05-18T00:04:40Z

doc/source/train/user-guides/data-loading-preprocessing.rst

@@ -276,34 +275,66 @@ At a high level, you can compare these concepts as follows:
     - n/a
     - :meth:`ray.data.Dataset.iter_torch_batches`

+Why using Ray Data?


Suggested change

Why using Ray Data?

Comparison with Ray Data

matthewdeng · 2024-05-18T00:08:46Z

doc/source/train/user-guides/data-loading-preprocessing.rst

+
+- Ray Data can utilize all resources in the Ray cluster for preprocessing, not just those on your training nodes. 
+
+For more details, see the following sections for each framework:


Let's add another sub-header here? Since the table is no longer explaining why to use Ray Data, but instructions on how to integrate an existing dataset with Ray Train. Maybe we can name it something like "Using PyTorch data with Ray Train"

Yeah that's a good idea. I also feel that we are missing something here. Will add a sub-header

matthewdeng · 2024-05-18T00:09:47Z

doc/source/train/user-guides/data-loading-preprocessing.rst

-        **Option 1 (with Ray Data):** Convert your PyTorch Dataset to a Ray Dataset and pass it into the Trainer via  ``datasets`` argument.
-        Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
-        You can convert this to replace the PyTorch DataLoader via :meth:`ray.data.DataIterator.iter_torch_batches`.
+        1. Convert your PyTorch Dataset to a Ray Dataset and 


nit:

Suggested change

1. Convert your PyTorch Dataset to a Ray Dataset and

1. Convert your PyTorch Dataset to a Ray Dataset.

There are some other small typos/formatting errors that I'll review more thoroughly in a follow-up review.

matthewdeng · 2024-05-18T00:13:54Z

doc/source/train/user-guides/data-loading-preprocessing.rst


-        For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
+        **Option 2 (with HuggingFace Dataset):** 


nit: I understand why you chose to do this but I'm also a little worried this might be confusing since Option 1 technically does also use Hugging Face Datasets.

Oh I realized the difference now. Previously, this section aims at teaching users how to convert their HF dataset to Ray Dataset, then do training. But this PR tries to directly categorize on what we eventually use in the training function.

# prev HF Dataset -> Ray Data -> HF Transformers HF Dataset -> HF Transformers # now Ray Data -> HF Transformers HF Dataset -> HF Transformers

My consideration here is that we'd better not force everyone to take the "HF Dataset -> Ray Data" conversion step.

For example, their original datasets format could be parquet, and before onboarding Ray, they already build a HF Dataset from parquet file, then feed it to HF Trainer.

In this case, they can either build ray dataset from parquet or from HF dataset.

# Before onboarding Ray raw data -> HF dataset -> HF transformer # After onboarding Ray option 1: raw data -> HF dataset -> Ray Data -> HF transformer v.s. option 2: raw data -> Ray Data -> HF transformer

We can discuss more in person next week.

matthewdeng · 2024-05-18T00:16:06Z

doc/source/train/user-guides/data-loading-preprocessing.rst

+**Streaming execution**:
+
+- The preprocessing pipeline will be executed lazily and stream the data batches into training workers.
+- Training can start immediately without significant up-front preprocessing time.
+
+**Automatic data sharding**: 
+
+- The dataset will be automatically sharded across all training workers. 

-For more details, see the following sections for each framework.
+**Leverage additional resources for preprocessing**
+
+- Ray Data can utilize all resources in the Ray cluster for preprocessing, not just those on your training nodes. 


This is good content that I think everyone should read, regardless of whether or not they are starting with PyTorch data. Do you think we could bring this higher up in the guide (e.g. even in the introduction), and then reference it from here?

OK. Sounds good to me.

justinvyu

High level comments:

Can we copy some content from this blog post? This user guide should be the place to compare Ray Data against other data ingest solutions. Particularly, I'm thinking of copying over the diagrams as well as the table comparing against torch dataloader, HF dataset, tf data, etc.
Proposed restructure of this guide:

(Ray Data + Ray Train) Quickstart
    -> Code examples, with the "Option 1: Ray Data" moved over here under each framework.
Why use Ray Data?
    -> Comparison with Other Data Ingest Solutions
        -> Comparison table
Alternative to Ray Data Ingest (Framework-native Dataloaders)
    -> "Ray Data is the recommended data loading solution for scalable blah blah blah, but Ray Train still integrates well with existing dataloading solutions you may be using, such as X, Y, Z.
    -> Link to the framework user guides, since we already go over how to set up the framework native dataloaders.
Ray Data Configurations
    -> all the remaining sections become subsections.

I think this structure fixes the problem where I was getting lost in the middle of the user guide because it suddenly starts talking about pytorch dataloader -- it wasn't clear that there are 2 separate paths: Ray Data vs. Alternatives. Now, we first put Ray Data front and center and make the case for it. Then, we talk about alternatives that are still integrated nicely.

For a follow-up PR, it would be nice to have some more realistic examples. For example, show read_parquet("s3://...") instead of the from_items dummy dataset that we have right now in the torch ray data quickstart. Can borrow this from the blog post again.

update

83feb79

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya requested review from matthewdeng, justinvyu and a team as code owners May 17, 2024 22:04

woshiyyya assigned justinvyu and matthewdeng May 17, 2024

matthewdeng reviewed May 18, 2024

View reviewed changes

justinvyu reviewed May 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train][Doc] Update PyTorch Data Ingestion User Guide #45421

[Train][Doc] Update PyTorch Data Ingestion User Guide #45421

woshiyyya commented May 17, 2024 •

edited

matthewdeng left a comment

matthewdeng May 17, 2024

woshiyyya May 18, 2024

matthewdeng May 18, 2024

matthewdeng May 17, 2024

matthewdeng May 18, 2024

matthewdeng May 18, 2024

woshiyyya May 18, 2024

matthewdeng May 18, 2024

matthewdeng May 18, 2024

woshiyyya May 18, 2024 •

edited

matthewdeng May 18, 2024

woshiyyya May 18, 2024

justinvyu left a comment •

edited

		These utilities can still be used directly with Ray Train. In particular, you may want to do this if you already have your data ingestion pipeline set up.
		However, for more performant large-scale data ingestion we do recommend migrating to Ray Data.


		- Ray Data can utilize all resources in the Ray cluster for preprocessing, not just those on your training nodes.

		For more details, see the following sections for each framework:

	1. Convert your PyTorch Dataset to a Ray Dataset and
	1. Convert your PyTorch Dataset to a Ray Dataset.


		For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
		Option 2 (with HuggingFace Dataset):

[Train][Doc] Update PyTorch Data Ingestion User Guide #45421

Are you sure you want to change the base?

[Train][Doc] Update PyTorch Data Ingestion User Guide #45421

Conversation

woshiyyya commented May 17, 2024 • edited

Why are these changes needed?

Related issue number

Checks

matthewdeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woshiyyya May 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinvyu left a comment • edited

Choose a reason for hiding this comment

woshiyyya commented May 17, 2024 •

edited

woshiyyya May 18, 2024 •

edited

justinvyu left a comment •

edited