Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#4605: Adding small query compiler #7259

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

arunjose696
Copy link
Collaborator

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Handle Empty/Small Data DataFrames as a separate case #4605
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@arunjose696 arunjose696 changed the title Adding small query compiler FEAT-#4605: Adding small query compiler May 13, 2024
@arunjose696 arunjose696 force-pushed the arun-sqc branch 3 times, most recently from f80e353 to 41bab97 Compare May 16, 2024 11:45
modin/pandas/base.py Fixed Show fixed Hide fixed
Comment on lines 623 to 626
if hasattr(pandas_frame, "_to_pandas"):
pandas_frame = pandas_frame._to_pandas()
if is_scalar(pandas_frame):
pandas_frame = pandas.DataFrame([pandas_frame])
elif not isinstance(pandas_frame, pandas.DataFrame):
pandas_frame = pandas.DataFrame(pandas_frame)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so many conditions? Don't we always pass a pandas DataFrame in to this constructor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataframe can be scalar when calling some pandas functions here , for eg in df.equals(). Thus we need to address this case in constructor.

The case if hasattr(pandas_frame, "_to_pandas") was included to allow casting from PandasQC to PlainPandasQueryCompiler which may be needed later.

modin/pandas/base.py Fixed Show fixed Hide fixed
modin/pandas/dataframe.py Fixed Show fixed Hide fixed
modin/pandas/io.py Fixed Show fixed Hide fixed
modin/pandas/series.py Fixed Show fixed Hide fixed
docs/conf.py Outdated Show resolved Hide resolved
modin/config/envvars.py Outdated Show resolved Hide resolved
@@ -47,7 +47,6 @@ def bin_ops_wrapper(df, other, *args, **kwargs):
"squeeze_other", False
)
squeeze_self = kwargs.pop("squeeze_self", False)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be reverted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 289 to 290
if isinstance(self._query_compiler, PlainPandasQueryCompiler):
return self._query_compiler.to_pandas().iloc[indexer]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed? Isn't it sufficient the line below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

modin/pandas/dataframe.py Outdated Show resolved Hide resolved
if UsePlainPandasQueryCompiler.get():
return ModinObjects.DataFrame(query_compiler=PlainPandasQueryCompiler(df))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move this to BaseFactory._to_pandas?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you mean from_pandas?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to BaseFactory.from_pandas

modin/pandas/dataframe.py Outdated Show resolved Hide resolved
modin/pandas/series.py Outdated Show resolved Hide resolved
setup.cfg Outdated Show resolved Hide resolved
modin/pandas/utils.py Fixed Show fixed Hide fixed
Copy link
Collaborator

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start on solving this problem! Is it possible to avoid so many of the test changes?

@@ -851,4 +851,11 @@ def _check_vars() -> None:
)


class UsePlainPandasQueryCompiler(EnvironmentVariable, type=bool):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is probably a little confusing for users. I suggest something like SmallDataframeMode. This can be set to None by default, and users can set it to "pandas" or some other option in the future (we may have some other single node options coming).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devin-petersohn, do you think VanillaPandasMode is a good option? Also, why do you think we should make this config of string type to have choices None/pandas/etc.? Wouldn't it be sufficient to have this config boolean - enable/disable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we may add polars mode. If this happens, we might also want to have an option for that. Making it a string keeps it open to other options. If we have pandas in the name, we can only use that mode for pandas execution. I'm open to other names, but I think we don't want to keep adding more and more configs if we have more options later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this sound like we may have multiple storage formats for a single execution? Do we really want to support this in future?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, yes I think this is something we could support in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devin-petersohn, do you think we could support automatic initialization with small qc depending on a data size threshold in future?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to rename UsePlainPandasQueryCompiler to NativeDataframeMode and SmallQueryCompiler to NativeQueryCompiler by sort of analogy with HdkOnNative we had previously.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum, a more complete definition of this class in the docstring is required.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the name to UsePlainPandasQueryCompiler to NativeDataframeMode and SmallQueryCompiler to NativeQueryCompiler.

@arunjose696
Copy link
Collaborator Author

Great start on solving this problem! Is it possible to avoid so many of the test changes?

The most changes in tests are disabling few checks as it wont be supported without partitions, and as the current changes dont yet support IO like pd.read_csv(), Is there something specific that should be avoided?

@devin-petersohn
Copy link
Collaborator

is there something specific that should be avoided?

Nothing specific, I was just trying to understand context. Thanks!

@arunjose696 arunjose696 marked this pull request as draft May 22, 2024 19:49
@arunjose696 arunjose696 force-pushed the arun-sqc branch 2 times, most recently from e6b035f to d406414 Compare May 23, 2024 11:08
modin/pandas/dataframe.py Fixed Show fixed Hide fixed
modin/pandas/dataframe.py Fixed Show fixed Hide fixed
@anmyachev
Copy link
Collaborator

@arunjose696 please rebase on main

arunjose696 and others added 12 commits June 5, 2024 05:24
Signed-off-by: arunjose696 <arunjose696@gmail.com>
Signed-off-by: arunjose696 <arunjose696@gmail.com>
Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
…rialized_dtypes to query compiler layer as in the code in multiple places the methods of private _modin_frame were used
Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
@arunjose696 arunjose696 force-pushed the arun-sqc branch 2 times, most recently from f917af6 to eae7f93 Compare June 5, 2024 12:17
Comment on lines +31 to +33
from modin.experimental.core.storage_formats.pandas.native_query_compiler import (
NativeQueryCompiler,
)

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
modin.experimental.core.storage_formats.pandas.native_query_compiler
begins an import cycle.
from pandas.core.dtypes.common import is_list_like, is_scalar

from modin.config.envvars import NativeDataframeMode
from modin.core.storage_formats.base.query_compiler import BaseQueryCompiler

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
modin.core.storage_formats.base.query_compiler
begins an import cycle.
Comment on lines +446 to +447
# if len(df.columns) == 1 and df.columns[0] == "__reduced__":
# df = df["__reduced__"]

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
@arunjose696 arunjose696 force-pushed the arun-sqc branch 3 times, most recently from 0a82d11 to ee5d15a Compare June 5, 2024 12:55
@@ -4574,6 +4574,17 @@ def frame_has_dtypes_cache(self) -> bool:
"""
return self._modin_frame.has_dtypes_cache

def has_dtypes_cache(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be removed, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines 368 to 397
def has_materialized_dtypes(self):
"""
Check if the undelying modin frame has materialized dtypes

Returns
-------
bool
True if if the undelying modin frame and False otherwise.
"""
return self._modin_frame.has_materialized_dtypes

def set_frame_dtypes_cache(self, dtypes):
"""
Set dtypes cache for the underlying modin frame.

Parameters
----------
dtypes : pandas.Series, ModinDtypes, callable or None
"""
self._modin_frame.set_dtypes_cache(dtypes)

def has_dtypes_cache(self) -> bool:
"""
Check if the dtypes cache exists for the underlying modin frame.

Returns
-------
bool
"""
return self._modin_frame.has_dtypes_cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- run: python -m pytest modin/tests/pandas/dataframe/test_reduce.py
- run: python -m pytest modin/tests/pandas/dataframe/test_udf.py
- run: python -m pytest modin/tests/pandas/dataframe/test_window.py
- uses: codecov/codecov-action@v2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Suggested change
- uses: codecov/codecov-action@v2
- uses: ./.github/actions/upload-coverage

@@ -144,7 +144,6 @@ def __init__(
name = MODIN_UNNAMED_SERIES_LABEL
if isinstance(data, pandas.Series) and data.name is not None:
name = data.name

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert

@@ -87,7 +88,11 @@
)
def test_ops_defaulting_to_pandas(op, make_args):
modin_df = pd.DataFrame(test_data_diff_dtype).drop(["str_col", "bool_col"], axis=1)
with warns_that_defaulting_to_pandas():
with (
warns_that_defaulting_to_pandas()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to update warns_that_defaulting_to_pandas itself to work with NativeDataframeMode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated warns_that_defaulting_to_pandas to return nullcontext in case of NativeDataframeMode

@arunjose696 arunjose696 marked this pull request as ready for review June 10, 2024 10:33
@arunjose696
Copy link
Collaborator Author

arunjose696 commented Jun 10, 2024

With the introduction of the small query compiler, we need to test the interoperability between DataFrames using different query compilers. For example, performing a binary operation between a DataFrame with the small query compiler and another with the Pandas query compiler. (Note: This feature is not yet included in this PR.)

This will require modifying or adding new tests. In the current tests in the modin/modin/tests/pandas/dataframe folder, we have the following scenarios where two DataFrames interact:

1)Derived DataFrames: In tests where the second DataFrame is created or derived from the first, egtest_join_empty, we need to refactor these tests so that the second DataFrame is created separately from the first and with MODIN_NATIVE_DATAFRAME_MODE set.

2)Lambda Functions: In tests where the other DataFrame is created within a lambda function, eg test___divmod__, we need to refactor these tests to either create the second DataFrame in the test definition itself or provide an additional wrapper for the lambda functions to ensure the DataFrame is created with a different query compilers.

3)Separate DataFrames: In tests where two separate DataFrames are used, eg test_where, we need to refactor these tests to include flipping the MODIN_NATIVE_DATAFRAME_MODE to None and Native_pandas when creating both the first and second DataFrame. This ensures that both the left and right operands are tested with different query compilers for interoperability. This flipping would also be required in cases mentioned in 1 and 2 after dataframes are separated.

Upon reviewing the modin/modin/tests/pandas/dataframe folder, I found approximately 100 tests involving scenarios where two DataFrames interact. These tests may need refactoring or copying to a different directory and updating to specifically test interoperability.

@YarShev @anmyachev @devin-petersohn, could you please provide suggestions on how to approach testing the interoperability?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle Empty/Small Data DataFrames as a separate case
4 participants