-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#4605: Adding small query compiler #7259
base: main
Are you sure you want to change the base?
Conversation
f80e353
to
41bab97
Compare
if hasattr(pandas_frame, "_to_pandas"): | ||
pandas_frame = pandas_frame._to_pandas() | ||
if is_scalar(pandas_frame): | ||
pandas_frame = pandas.DataFrame([pandas_frame]) | ||
elif not isinstance(pandas_frame, pandas.DataFrame): | ||
pandas_frame = pandas.DataFrame(pandas_frame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why so many conditions? Don't we always pass a pandas DataFrame in to this constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dataframe can be scalar when calling some pandas functions here , for eg in df.equals(). Thus we need to address this case in constructor.
The case if hasattr(pandas_frame, "_to_pandas")
was included to allow casting from PandasQC to PlainPandasQueryCompiler which may be needed later.
@@ -47,7 +47,6 @@ def bin_ops_wrapper(df, other, *args, **kwargs): | |||
"squeeze_other", False | |||
) | |||
squeeze_self = kwargs.pop("squeeze_self", False) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
modin/pandas/base.py
Outdated
if isinstance(self._query_compiler, PlainPandasQueryCompiler): | ||
return self._query_compiler.to_pandas().iloc[indexer] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed? Isn't it sufficient the line below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
modin/pandas/io.py
Outdated
if UsePlainPandasQueryCompiler.get(): | ||
return ModinObjects.DataFrame(query_compiler=PlainPandasQueryCompiler(df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should move this to BaseFactory._to_pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you mean from_pandas
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to BaseFactory.from_pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start on solving this problem! Is it possible to avoid so many of the test changes?
modin/config/envvars.py
Outdated
@@ -851,4 +851,11 @@ def _check_vars() -> None: | |||
) | |||
|
|||
|
|||
class UsePlainPandasQueryCompiler(EnvironmentVariable, type=bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is probably a little confusing for users. I suggest something like SmallDataframeMode
. This can be set to None
by default, and users can set it to "pandas"
or some other option in the future (we may have some other single node options coming).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devin-petersohn, do you think VanillaPandasMode
is a good option? Also, why do you think we should make this config of string
type to have choices None
/pandas
/etc.
? Wouldn't it be sufficient to have this config boolean - enable/disable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we may add polars
mode. If this happens, we might also want to have an option for that. Making it a string keeps it open to other options. If we have pandas in the name, we can only use that mode for pandas execution. I'm open to other names, but I think we don't want to keep adding more and more configs if we have more options later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this sound like we may have multiple storage formats for a single execution? Do we really want to support this in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially, yes I think this is something we could support in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devin-petersohn, do you think we could support automatic initialization with small qc depending on a data size threshold in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to rename UsePlainPandasQueryCompiler
to NativeDataframeMode
and SmallQueryCompiler
to NativeQueryCompiler
by sort of analogy with HdkOnNative
we had previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a minimum, a more complete definition of this class in the docstring is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the name to UsePlainPandasQueryCompiler to NativeDataframeMode and SmallQueryCompiler to NativeQueryCompiler.
The most changes in tests are disabling few checks as it wont be supported without partitions, and as the current changes dont yet support IO like pd.read_csv(), Is there something specific that should be avoided? |
Nothing specific, I was just trying to understand context. Thanks! |
e6b035f
to
d406414
Compare
f1eff14
to
8bc38b8
Compare
@arunjose696 please rebase on main |
Signed-off-by: arunjose696 <arunjose696@gmail.com>
Signed-off-by: arunjose696 <arunjose696@gmail.com>
Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
…rialized_dtypes to query compiler layer as in the code in multiple places the methods of private _modin_frame were used
Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
f917af6
to
eae7f93
Compare
from modin.experimental.core.storage_formats.pandas.native_query_compiler import ( | ||
NativeQueryCompiler, | ||
) |
Check notice
Code scanning / CodeQL
Cyclic import Note
modin.experimental.core.storage_formats.pandas.native_query_compiler
from pandas.core.dtypes.common import is_list_like, is_scalar | ||
|
||
from modin.config.envvars import NativeDataframeMode | ||
from modin.core.storage_formats.base.query_compiler import BaseQueryCompiler |
Check notice
Code scanning / CodeQL
Cyclic import Note
modin.core.storage_formats.base.query_compiler
# if len(df.columns) == 1 and df.columns[0] == "__reduced__": | ||
# df = df["__reduced__"] |
Check notice
Code scanning / CodeQL
Commented-out code Note
0a82d11
to
ee5d15a
Compare
@@ -4574,6 +4574,17 @@ def frame_has_dtypes_cache(self) -> bool: | |||
""" | |||
return self._modin_frame.has_dtypes_cache | |||
|
|||
def has_dtypes_cache(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be removed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
def has_materialized_dtypes(self): | ||
""" | ||
Check if the undelying modin frame has materialized dtypes | ||
|
||
Returns | ||
------- | ||
bool | ||
True if if the undelying modin frame and False otherwise. | ||
""" | ||
return self._modin_frame.has_materialized_dtypes | ||
|
||
def set_frame_dtypes_cache(self, dtypes): | ||
""" | ||
Set dtypes cache for the underlying modin frame. | ||
|
||
Parameters | ||
---------- | ||
dtypes : pandas.Series, ModinDtypes, callable or None | ||
""" | ||
self._modin_frame.set_dtypes_cache(dtypes) | ||
|
||
def has_dtypes_cache(self) -> bool: | ||
""" | ||
Check if the dtypes cache exists for the underlying modin frame. | ||
|
||
Returns | ||
------- | ||
bool | ||
""" | ||
return self._modin_frame.has_dtypes_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
.github/workflows/ci.yml
Outdated
- run: python -m pytest modin/tests/pandas/dataframe/test_reduce.py | ||
- run: python -m pytest modin/tests/pandas/dataframe/test_udf.py | ||
- run: python -m pytest modin/tests/pandas/dataframe/test_window.py | ||
- uses: codecov/codecov-action@v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
- uses: codecov/codecov-action@v2 | |
- uses: ./.github/actions/upload-coverage |
modin/pandas/series.py
Outdated
@@ -144,7 +144,6 @@ def __init__( | |||
name = MODIN_UNNAMED_SERIES_LABEL | |||
if isinstance(data, pandas.Series) and data.name is not None: | |||
name = data.name | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert
@@ -87,7 +88,11 @@ | |||
) | |||
def test_ops_defaulting_to_pandas(op, make_args): | |||
modin_df = pd.DataFrame(test_data_diff_dtype).drop(["str_col", "bool_col"], axis=1) | |||
with warns_that_defaulting_to_pandas(): | |||
with ( | |||
warns_that_defaulting_to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to update warns_that_defaulting_to_pandas
itself to work with NativeDataframeMode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated warns_that_defaulting_to_pandas to return nullcontext in case of NativeDataframeMode
With the introduction of the small query compiler, we need to test the interoperability between DataFrames using different query compilers. For example, performing a binary operation between a DataFrame with the small query compiler and another with the Pandas query compiler. (Note: This feature is not yet included in this PR.) This will require modifying or adding new tests. In the current tests in the 1)Derived DataFrames: In tests where the second DataFrame is created or derived from the first, egtest_join_empty, we need to refactor these tests so that the second DataFrame is created separately from the first and with MODIN_NATIVE_DATAFRAME_MODE set. 2)Lambda Functions: In tests where the other DataFrame is created within a lambda function, eg test___divmod__, we need to refactor these tests to either create the second DataFrame in the test definition itself or provide an additional wrapper for the lambda functions to ensure the DataFrame is created with a different query compilers. 3)Separate DataFrames: In tests where two separate DataFrames are used, eg test_where, we need to refactor these tests to include flipping the MODIN_NATIVE_DATAFRAME_MODE to None and Native_pandas when creating both the first and second DataFrame. This ensures that both the left and right operands are tested with different query compilers for interoperability. This flipping would also be required in cases mentioned in 1 and 2 after dataframes are separated. Upon reviewing the @YarShev @anmyachev @devin-petersohn, could you please provide suggestions on how to approach testing the interoperability? |
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date