FEAT-#4605: Adding small query compiler #7259

arunjose696 · 2024-05-13T18:53:19Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Handle Empty/Small Data DataFrames as a separate case #4605
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

modin/experimental/core/storage_formats/pandas/small_query_compiler.py

modin/pandas/dataframe.py

modin/pandas/io.py

modin/pandas/series.py

modin/experimental/core/storage_formats/pandas/small_query_compiler.py

modin/pandas/base.py

YarShev · 2024-05-16T14:17:03Z

modin/experimental/core/storage_formats/pandas/small_query_compiler.py

+        if hasattr(pandas_frame, "_to_pandas"):
+            pandas_frame = pandas_frame._to_pandas()
+        if is_scalar(pandas_frame):
+            pandas_frame = pandas.DataFrame([pandas_frame])
+        elif not isinstance(pandas_frame, pandas.DataFrame):
+            pandas_frame = pandas.DataFrame(pandas_frame)


Why so many conditions? Don't we always pass a pandas DataFrame in to this constructor?

The dataframe can be scalar when calling some pandas functions here , for eg in df.equals(). Thus we need to address this case in constructor.

The case if hasattr(pandas_frame, "_to_pandas") was included to allow casting from PandasQC to PlainPandasQueryCompiler which may be needed later.

modin/pandas/base.py

modin/pandas/dataframe.py

modin/pandas/io.py

modin/pandas/series.py

modin/experimental/core/storage_formats/pandas/small_query_compiler.py

docs/conf.py

modin/config/envvars.py

YarShev · 2024-05-16T17:10:15Z

modin/core/dataframe/algebra/default2pandas/binary.py

@@ -47,7 +47,6 @@ def bin_ops_wrapper(df, other, *args, **kwargs):
                "squeeze_other", False
            )
            squeeze_self = kwargs.pop("squeeze_self", False)
-


Should be reverted.

YarShev · 2024-05-16T17:11:30Z

modin/pandas/base.py

+        if isinstance(self._query_compiler, PlainPandasQueryCompiler):
+            return self._query_compiler.to_pandas().iloc[indexer]


Why is this change needed? Isn't it sufficient the line below?

modin/pandas/dataframe.py

YarShev · 2024-05-16T17:20:49Z

modin/pandas/io.py

+    if UsePlainPandasQueryCompiler.get():
+        return ModinObjects.DataFrame(query_compiler=PlainPandasQueryCompiler(df))


Maybe we should move this to BaseFactory._to_pandas?

Maybe you mean from_pandas?

moved to BaseFactory.from_pandas

modin/pandas/dataframe.py

modin/pandas/series.py

setup.cfg

modin/pandas/utils.py

devin-petersohn

Great start on solving this problem! Is it possible to avoid so many of the test changes?

devin-petersohn · 2024-05-22T15:34:27Z

modin/config/envvars.py

@@ -851,4 +851,11 @@ def _check_vars() -> None:
        )


+class UsePlainPandasQueryCompiler(EnvironmentVariable, type=bool):


This name is probably a little confusing for users. I suggest something like SmallDataframeMode. This can be set to None by default, and users can set it to "pandas" or some other option in the future (we may have some other single node options coming).

@devin-petersohn, do you think VanillaPandasMode is a good option? Also, why do you think we should make this config of string type to have choices None/pandas/etc.? Wouldn't it be sufficient to have this config boolean - enable/disable?

In the future we may add polars mode. If this happens, we might also want to have an option for that. Making it a string keeps it open to other options. If we have pandas in the name, we can only use that mode for pandas execution. I'm open to other names, but I think we don't want to keep adding more and more configs if we have more options later.

Doesn't this sound like we may have multiple storage formats for a single execution? Do we really want to support this in future?

Potentially, yes I think this is something we could support in the future.

@devin-petersohn, do you think we could support automatic initialization with small qc depending on a data size threshold in future?

I propose to rename UsePlainPandasQueryCompiler to NativeDataframeMode and SmallQueryCompiler to NativeQueryCompiler by sort of analogy with HdkOnNative we had previously.

At a minimum, a more complete definition of this class in the docstring is required.

I will update the name to UsePlainPandasQueryCompiler to NativeDataframeMode and SmallQueryCompiler to NativeQueryCompiler.

arunjose696 · 2024-05-22T16:31:33Z

Great start on solving this problem! Is it possible to avoid so many of the test changes?

The most changes in tests are disabling few checks as it wont be supported without partitions, and as the current changes dont yet support IO like pd.read_csv(), Is there something specific that should be avoided?

devin-petersohn · 2024-05-22T16:45:06Z

is there something specific that should be avoided?

Nothing specific, I was just trying to understand context. Thanks!

modin/pandas/dataframe.py

anmyachev · 2024-06-05T10:15:53Z

@arunjose696 please rebase on main

Signed-off-by: arunjose696 <arunjose696@gmail.com>

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

…rialized_dtypes to query compiler layer as in the code in multiple places the methods of private _modin_frame were used

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

modin/core/execution/dispatching/factories/factories.py

+from modin.experimental.core.storage_formats.pandas.native_query_compiler import (
+    NativeQueryCompiler,
+)


modin/experimental/core/storage_formats/pandas/native_query_compiler.py

+from pandas.core.dtypes.common import is_list_like, is_scalar
+
+from modin.config.envvars import NativeDataframeMode
+from modin.core.storage_formats.base.query_compiler import BaseQueryCompiler


modin/experimental/core/storage_formats/pandas/native_query_compiler.py

+    # if len(df.columns) == 1 and df.columns[0] == "__reduced__":
+    #     df = df["__reduced__"]


anmyachev · 2024-06-07T13:56:11Z

modin/core/storage_formats/base/query_compiler.py

@@ -4574,6 +4574,17 @@ def frame_has_dtypes_cache(self) -> bool:
        """
        return self._modin_frame.has_dtypes_cache

+    def has_dtypes_cache(self) -> bool:


Should be removed, right?

anmyachev · 2024-06-07T13:57:02Z

modin/core/storage_formats/pandas/query_compiler.py

+    def has_materialized_dtypes(self):
+        """
+        Check if the undelying modin frame has materialized dtypes
+
+        Returns
+        -------
+        bool
+            True if if the undelying modin frame and False otherwise.
+        """
+        return self._modin_frame.has_materialized_dtypes
+
+    def set_frame_dtypes_cache(self, dtypes):
+        """
+        Set dtypes cache for the underlying modin frame.
+
+        Parameters
+        ----------
+        dtypes : pandas.Series, ModinDtypes, callable or None
+        """
+        self._modin_frame.set_dtypes_cache(dtypes)
+
+    def has_dtypes_cache(self) -> bool:
+        """
+        Check if the dtypes cache exists for the underlying modin frame.
+
+        Returns
+        -------
+        bool
+        """
+        return self._modin_frame.has_dtypes_cache


Should be removed?

anmyachev · 2024-06-07T14:03:13Z

.github/workflows/ci.yml

+      - run: python -m pytest modin/tests/pandas/dataframe/test_reduce.py
+      - run: python -m pytest modin/tests/pandas/dataframe/test_udf.py
+      - run: python -m pytest modin/tests/pandas/dataframe/test_window.py
+      - uses: codecov/codecov-action@v2


?

Suggested change

- uses: codecov/codecov-action@v2

- uses: ./.github/actions/upload-coverage

anmyachev · 2024-06-07T14:05:16Z

modin/pandas/series.py

@@ -144,7 +144,6 @@ def __init__(
                name = MODIN_UNNAMED_SERIES_LABEL
                if isinstance(data, pandas.Series) and data.name is not None:
                    name = data.name
-


please revert

anmyachev · 2024-06-07T14:16:54Z

modin/tests/pandas/dataframe/test_default.py

@@ -87,7 +88,11 @@
 )
 def test_ops_defaulting_to_pandas(op, make_args):
    modin_df = pd.DataFrame(test_data_diff_dtype).drop(["str_col", "bool_col"], axis=1)
-    with warns_that_defaulting_to_pandas():
+    with (
+        warns_that_defaulting_to_pandas()


It's better to update warns_that_defaulting_to_pandas itself to work with NativeDataframeMode.

Updated warns_that_defaulting_to_pandas to return nullcontext in case of NativeDataframeMode

arunjose696 · 2024-06-10T16:54:21Z

With the introduction of the small query compiler, we need to test the interoperability between DataFrames using different query compilers. For example, performing a binary operation between a DataFrame with the small query compiler and another with the Pandas query compiler. (Note: This feature is not yet included in this PR.)

This will require modifying or adding new tests. In the current tests in the modin/modin/tests/pandas/dataframe folder, we have the following scenarios where two DataFrames interact:

1)Derived DataFrames: In tests where the second DataFrame is created or derived from the first, egtest_join_empty, we need to refactor these tests so that the second DataFrame is created separately from the first and with MODIN_NATIVE_DATAFRAME_MODE set.

2)Lambda Functions: In tests where the other DataFrame is created within a lambda function, eg test___divmod__, we need to refactor these tests to either create the second DataFrame in the test definition itself or provide an additional wrapper for the lambda functions to ensure the DataFrame is created with a different query compilers.

3)Separate DataFrames: In tests where two separate DataFrames are used, eg test_where, we need to refactor these tests to include flipping the MODIN_NATIVE_DATAFRAME_MODE to None and Native_pandas when creating both the first and second DataFrame. This ensures that both the left and right operands are tested with different query compilers for interoperability. This flipping would also be required in cases mentioned in 1 and 2 after dataframes are separated.

Upon reviewing the modin/modin/tests/pandas/dataframe folder, I found approximately 100 tests involving scenarios where two DataFrames interact. These tests may need refactoring or copying to a different directory and updating to specifically test interoperability.

@YarShev @anmyachev @devin-petersohn, could you please provide suggestions on how to approach testing the interoperability?

arunjose696 requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev, dchigarev and a team as code owners May 13, 2024 18:53

github-advanced-security bot found potential problems May 13, 2024

View reviewed changes

arunjose696 changed the title ~~Adding small query compiler~~ FEAT-#4605: Adding small query compiler May 13, 2024

arunjose696 force-pushed the arun-sqc branch 3 times, most recently from f80e353 to 41bab97 Compare May 16, 2024 11:45

YarShev force-pushed the arun-sqc branch from 41bab97 to b6dc27c Compare May 16, 2024 12:33

github-advanced-security bot found potential problems May 16, 2024

View reviewed changes

modin/pandas/base.py Fixed Show fixed Hide fixed

YarShev reviewed May 16, 2024

View reviewed changes

YarShev force-pushed the arun-sqc branch from 8c6544e to 165360f Compare May 16, 2024 15:17

github-advanced-security bot found potential problems May 16, 2024

View reviewed changes

YarShev reviewed May 16, 2024

View reviewed changes

arunjose696 force-pushed the arun-sqc branch from b9f1dc3 to df6b6dc Compare May 22, 2024 13:11

github-advanced-security bot found potential problems May 22, 2024

View reviewed changes

modin/pandas/utils.py Fixed Show fixed Hide fixed

arunjose696 force-pushed the arun-sqc branch from df6b6dc to 1cd75e2 Compare May 22, 2024 13:15

devin-petersohn reviewed May 22, 2024

View reviewed changes

arunjose696 marked this pull request as draft May 22, 2024 19:49

arunjose696 force-pushed the arun-sqc branch 2 times, most recently from e6b035f to d406414 Compare May 23, 2024 11:08

github-advanced-security bot found potential problems May 23, 2024

View reviewed changes

modin/pandas/dataframe.py Fixed Show fixed Hide fixed

modin/pandas/dataframe.py Fixed Show fixed Hide fixed

arunjose696 force-pushed the arun-sqc branch from e9dbc16 to 631dbf2 Compare May 23, 2024 20:50

arunjose696 force-pushed the arun-sqc branch 2 times, most recently from f1eff14 to 8bc38b8 Compare May 29, 2024 14:11

arunjose696 mentioned this pull request May 29, 2024

REFACTOR: Minimize the access of methods _modin_frame methods from ._query_compiler layer #7294

Closed

arunjose696 and others added 12 commits June 5, 2024 05:24

FEAT-modin-project#4605: Add small query compiler

140258b

fixing tests

4e96742

removing additional parameter from try_cast_to_pandas

352ca3b

Signed-off-by: arunjose696 <arunjose696@gmail.com>

test_iter passing

027032f

fixing isin unique and clip

8c93cd1

Signed-off-by: arunjose696 <arunjose696@gmail.com>

Enable test_default.py and test_join_sort.py

d67f9a0

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

fixed test_map_metadata by adding set_frame_dtypes_cache and has_mate…

f66c8a0

…rialized_dtypes to query compiler layer as in the code in multiple places the methods of private _modin_frame were used

Fix test_dot

f87e6c3

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

test_udf passing

70cdc1c

All tests except one passing in modin/tests/pandas/dataframe

ba547f3

All tests in modin/tests/pandas/dataframe/ passing

cfb0847

PR comments

c486925

arunjose696 force-pushed the arun-sqc branch 2 times, most recently from f917af6 to eae7f93 Compare June 5, 2024 12:17

github-advanced-security bot found potential problems Jun 5, 2024

View reviewed changes

arunjose696 force-pushed the arun-sqc branch 3 times, most recently from 0a82d11 to ee5d15a Compare June 5, 2024 12:55

renaming to PlainPandasQueryCompiler to NativeDataframeMode

1984aa1

arunjose696 force-pushed the arun-sqc branch from ee5d15a to 1984aa1 Compare June 5, 2024 15:25

anmyachev reviewed Jun 7, 2024

View reviewed changes

arunjose696 added 2 commits June 10, 2024 03:48

renaming to PlainPandasQueryCompiler to NativeDataframeMode

625da37

PR comments + changes

da02e5f

arunjose696 force-pushed the arun-sqc branch from b1b8ff3 to da02e5f Compare June 10, 2024 08:56

arunjose696 marked this pull request as ready for review June 10, 2024 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#4605: Adding small query compiler #7259

FEAT-#4605: Adding small query compiler #7259

arunjose696 commented May 13, 2024

YarShev May 16, 2024

arunjose696 May 29, 2024

YarShev May 16, 2024

arunjose696 May 29, 2024

YarShev May 16, 2024

arunjose696 May 29, 2024

YarShev May 16, 2024

anmyachev May 27, 2024

arunjose696 May 29, 2024

devin-petersohn left a comment

devin-petersohn May 22, 2024

YarShev May 22, 2024

devin-petersohn May 22, 2024

YarShev May 22, 2024

devin-petersohn May 24, 2024

YarShev May 27, 2024

YarShev May 27, 2024

anmyachev May 27, 2024

arunjose696 May 29, 2024

arunjose696 commented May 22, 2024

devin-petersohn commented May 22, 2024

anmyachev commented Jun 5, 2024

anmyachev Jun 7, 2024

arunjose696 Jun 10, 2024

anmyachev Jun 7, 2024

arunjose696 Jun 10, 2024

anmyachev Jun 7, 2024

anmyachev Jun 7, 2024

anmyachev Jun 7, 2024

arunjose696 Jun 10, 2024

arunjose696 commented Jun 10, 2024 •

edited

		if isinstance(self._query_compiler, PlainPandasQueryCompiler):
		return self._query_compiler.to_pandas().iloc[indexer]

		if UsePlainPandasQueryCompiler.get():
		return ModinObjects.DataFrame(query_compiler=PlainPandasQueryCompiler(df))

		@@ -851,4 +851,11 @@ def _check_vars() -> None:
		)


		class UsePlainPandasQueryCompiler(EnvironmentVariable, type=bool):

		# if len(df.columns) == 1 and df.columns[0] == "__reduced__":
		# df = df["__reduced__"]

	- uses: codecov/codecov-action@v2
	- uses: ./.github/actions/upload-coverage

FEAT-#4605: Adding small query compiler #7259

Are you sure you want to change the base?

FEAT-#4605: Adding small query compiler #7259

Conversation

arunjose696 commented May 13, 2024

What do these changes do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devin-petersohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arunjose696 commented May 22, 2024

devin-petersohn commented May 22, 2024

anmyachev commented Jun 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arunjose696 commented Jun 10, 2024 • edited

arunjose696 commented Jun 10, 2024 •

edited