Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enh/group by.transform should accept similar arguments to group by.agg #58773

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

abeltavares
Copy link
Contributor

@abeltavares abeltavares force-pushed the ENH/GroupBy.transform-should-accept-similar-arguments-to-GroupBy.agg branch 10 times, most recently from f49fa6d to e8ae75f Compare May 20, 2024 13:11

@staticmethod
def _named_agg_to_dict(*named_aggs: tuple[str, NamedAgg]) -> dict[str, NamedAgg]:
valid_items = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you inline this function since it's only use once in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, simple is best, will Inline it.

# Convert named aggregations to dictionary format
transformed_func = self._named_agg_to_dict(*kwargs.items())
kwargs = {}
if isinstance(transformed_func, dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this if statement needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this check is redundant, will remove it.

)
results.append(result)
col_names.extend([(col, name) for col in result.columns])
output = concat(results, axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output = concat(results, axis=1)
output = concat(results, ignore_index=True, axis=1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the concatenation of results using concat with axis=1 align the columns correctly without needing to reset the index. The original index of the DataFrame will be preserved right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resulting .index should be preserved, but the since you're overwriting the .columns 2 lines down, we don't need concat to come up with the merged .column result and do less work by just returning a RangeIndex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense, understood.

results.append(result)
col_names.extend([(col, name) for col in result.columns])
output = concat(results, axis=1)
output.columns = MultiIndex.from_tuples(col_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use from_arrays or use the MultiIndex constructor instead? from_tuples is slow and needs to do inference on the data types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to know, will go with from_arrays.

col_names.extend([(col, name) for col in result.columns])
output = concat(results, axis=1)
output.columns = MultiIndex.from_tuples(col_names)
output.sort_index(axis=1, level=[0], sort_remaining=False, inplace=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you avoid inplace=False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i will just return a new sorted DataFrame.

@mroeschke mroeschke added Groupby Apply Apply, Aggregate, Transform labels May 20, 2024
@abeltavares abeltavares force-pushed the ENH/GroupBy.transform-should-accept-similar-arguments-to-GroupBy.agg branch 3 times, most recently from e63be2e to e340efb Compare May 20, 2024 20:24
Comment on lines 1915 to 1919
transformed_func = {
name: aggfunc
for name, aggfunc in kwargs.items()
if not isinstance(aggfunc[1], DataFrame)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case is this a DataFrame?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i guess aggfunc should always be a callable or a string, so that check is not needed.

for name, aggfunc in kwargs.items()
if not isinstance(aggfunc[1], DataFrame)
}
kwargs = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can just not pass kwargs below.

Comment on lines 1948 to 1954
if isinstance(func, dict):
for name, named_agg in func.items():
column_name = named_agg.column
agg_func = named_agg.aggfunc
result = self._transform_single_column(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also support, e.g. .transform({"a": "sum", "b": "mean"}).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify @rhshadrach
that would do the same thing as .transform(["sum", "min"])?
but with columns names "a" and "b" instead of "sum" and "min"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, .transform(["sum", "min"]) would also act on other columns if there are any; .transform({"a": "sum", "b": "mean"}) will only act on columns a and b.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, added the functionality including tests.
Also some changes regarding with list-like option.
The behavior when used with more columns was wrong:

image

was doing the above instead of:
image

It's ok now, added also to the tests

col_names = []
columns = [com.get_callable_name(f) or f for f in func]
func_pairs = zip(columns, func)
for idx, (name, func_item) in enumerate(func_pairs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idx is unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, actually, i will remove it

output = concat(results, ignore_index=True, axis=1)
arrays = [list(x) for x in zip(*col_names)]
output.columns = MultiIndex.from_arrays(arrays)
output = output.sort_index(axis=1, level=[0], sort_remaining=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want the order to reflect that of the input. Why sort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not actually needed, will remove that logic.

Comment on lines 1993 to 1995
data = self._obj_with_exclusions[column_name]
groupings = self._grouper.groupings
result = data.groupby(groupings).transform(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use self._gotitem instead of recreating the groupby object. Also, I think this function will return a DataFrame when there are duplicate column names. Can you add tests for the duplicate column name cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go with self._gotitem then instead and added the additional test.

@abeltavares abeltavares force-pushed the ENH/GroupBy.transform-should-accept-similar-arguments-to-GroupBy.agg branch from e340efb to f2abbdf Compare May 21, 2024 07:37
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abeltavares - force pushing to this branch makes GitHub lose track of the changes that have been reviewed, so the reviewer has to review everything again. Please don't force push if at all possible.

@abeltavares abeltavares force-pushed the ENH/GroupBy.transform-should-accept-similar-arguments-to-GroupBy.agg branch from 5d02596 to 78b6e90 Compare May 23, 2024 19:48
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two other cases we will need to handle to achieve feature parity with the agg equivalent:

  • dict of lists, e.g. {"a": ["sum", "min"], "b": "max"}
  • SeriesGroupBy.transform with a list

While it'd be great to implement here, I'd be okay with raising a NotImplementedError for these cases - but we should have a clear error message that these are intended to be implemented in the future.

Comment on lines +1968 to +1971
keys_list = list(self.keys) if isinstance(self.keys, list) else [self.keys]
for column in self.obj.columns:
if column in keys_list:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterate over self._obj_with_exclusions instead. Then you don't need keys_list.

engine_kwargs: dict | None = None,
**kwargs,
) -> Series:
data = self._gotitem(column_name, ndim=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if the input has duplicate column names. I would be okay with not supporting duplicate columns here (and raising a clear error message), but this would need further support from other maintainers. The other option is to make this work on duplicate column names.

Comment on lines +88 to +89
def test_transform_with_list_like():
df = DataFrame({"col": list("aab"), "val": range(3), "another": range(3)})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add references to each test, e.g.

def test_transform_with_list_like():
    # GH#58318

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: GroupBy.transform should accept similar arguments to GroupBy.agg
3 participants