Enh/group by.transform should accept similar arguments to group by.agg #58773

abeltavares · 2024-05-18T17:00:18Z

closes ENH: GroupBy.transform should accept similar arguments to GroupBy.agg #58318
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

mroeschke · 2024-05-20T17:01:29Z

pandas/core/groupby/generic.py

+
+    @staticmethod
+    def _named_agg_to_dict(*named_aggs: tuple[str, NamedAgg]) -> dict[str, NamedAgg]:
+        valid_items = [


Could you inline this function since it's only use once in this file?

Yeah, simple is best, will Inline it.

mroeschke · 2024-05-20T17:02:06Z

pandas/core/groupby/generic.py

+            # Convert named aggregations to dictionary format
+            transformed_func = self._named_agg_to_dict(*kwargs.items())
+            kwargs = {}
+            if isinstance(transformed_func, dict):


Why is this if statement needed?

I guess this check is redundant, will remove it.

mroeschke · 2024-05-20T17:02:50Z

pandas/core/groupby/generic.py

+                )
+                results.append(result)
+                col_names.extend([(col, name) for col in result.columns])
+            output = concat(results, axis=1)


Suggested change

output = concat(results, axis=1)

output = concat(results, ignore_index=True, axis=1)

Doesn't the concatenation of results using concat with axis=1 align the columns correctly without needing to reset the index. The original index of the DataFrame will be preserved right?

The resulting .index should be preserved, but the since you're overwriting the .columns 2 lines down, we don't need concat to come up with the merged .column result and do less work by just returning a RangeIndex

Yeah, makes sense, understood.

mroeschke · 2024-05-20T17:03:50Z

pandas/core/groupby/generic.py

+                results.append(result)
+                col_names.extend([(col, name) for col in result.columns])
+            output = concat(results, axis=1)
+            output.columns = MultiIndex.from_tuples(col_names)


Is it possible to use from_arrays or use the MultiIndex constructor instead? from_tuples is slow and needs to do inference on the data types

Nice to know, will go with from_arrays.

mroeschke · 2024-05-20T17:04:55Z

pandas/core/groupby/generic.py

+                col_names.extend([(col, name) for col in result.columns])
+            output = concat(results, axis=1)
+            output.columns = MultiIndex.from_tuples(col_names)
+            output.sort_index(axis=1, level=[0], sort_remaining=False, inplace=True)


Can you avoid inplace=False?

Yeah, i will just return a new sorted DataFrame.

rhshadrach · 2024-05-20T20:15:14Z

pandas/core/groupby/generic.py

+            transformed_func = {
+                name: aggfunc
+                for name, aggfunc in kwargs.items()
+                if not isinstance(aggfunc[1], DataFrame)
+            }


In what case is this a DataFrame?

Yeah, i guess aggfunc should always be a callable or a string, so that check is not needed.

rhshadrach · 2024-05-20T20:16:50Z

pandas/core/groupby/generic.py

+                for name, aggfunc in kwargs.items()
+                if not isinstance(aggfunc[1], DataFrame)
+            }
+            kwargs = {}


nit: can just not pass kwargs below.

rhshadrach · 2024-05-20T20:18:51Z

pandas/core/groupby/generic.py

+        if isinstance(func, dict):
+            for name, named_agg in func.items():
+                column_name = named_agg.column
+                agg_func = named_agg.aggfunc
+                result = self._transform_single_column(


Can we also support, e.g. .transform({"a": "sum", "b": "mean"}).

Just to clarify @rhshadrach
that would do the same thing as .transform(["sum", "min"])?
but with columns names "a" and "b" instead of "sum" and "min"?

No, .transform(["sum", "min"]) would also act on other columns if there are any; .transform({"a": "sum", "b": "mean"}) will only act on columns a and b.

Understood, added the functionality including tests.
Also some changes regarding with list-like option.
The behavior when used with more columns was wrong:

was doing the above instead of:

It's ok now, added also to the tests

rhshadrach · 2024-05-20T20:23:17Z

pandas/core/groupby/generic.py

+            col_names = []
+            columns = [com.get_callable_name(f) or f for f in func]
+            func_pairs = zip(columns, func)
+            for idx, (name, func_item) in enumerate(func_pairs):


idx is unused?

yeah, actually, i will remove it

rhshadrach · 2024-05-20T20:25:45Z

pandas/core/groupby/generic.py

+            output = concat(results, ignore_index=True, axis=1)
+            arrays = [list(x) for x in zip(*col_names)]
+            output.columns = MultiIndex.from_arrays(arrays)
+            output = output.sort_index(axis=1, level=[0], sort_remaining=False)


I think we want the order to reflect that of the input. Why sort?

I guess not actually needed, will remove that logic.

rhshadrach · 2024-05-20T20:34:43Z

pandas/core/groupby/generic.py

+        data = self._obj_with_exclusions[column_name]
+        groupings = self._grouper.groupings
+        result = data.groupby(groupings).transform(


Can you use self._gotitem instead of recreating the groupby object. Also, I think this function will return a DataFrame when there are duplicate column names. Can you add tests for the duplicate column name cases.

I will go with self._gotitem then instead and added the additional test.

rhshadrach

@abeltavares - force pushing to this branch makes GitHub lose track of the changes that have been reviewed, so the reviewer has to review everything again. Please don't force push if at all possible.

rhshadrach

There are two other cases we will need to handle to achieve feature parity with the agg equivalent:

dict of lists, e.g. {"a": ["sum", "min"], "b": "max"}
SeriesGroupBy.transform with a list

While it'd be great to implement here, I'd be okay with raising a NotImplementedError for these cases - but we should have a clear error message that these are intended to be implemented in the future.

rhshadrach · 2024-05-28T22:01:01Z

pandas/core/groupby/generic.py

+            keys_list = list(self.keys) if isinstance(self.keys, list) else [self.keys]
+            for column in self.obj.columns:
+                if column in keys_list:
+                    continue


Iterate over self._obj_with_exclusions instead. Then you don't need keys_list.

rhshadrach · 2024-05-28T22:08:05Z

pandas/core/groupby/generic.py

+        engine_kwargs: dict | None = None,
+        **kwargs,
+    ) -> Series:
+        data = self._gotitem(column_name, ndim=1)


This will fail if the input has duplicate column names. I would be okay with not supporting duplicate columns here (and raising a clear error message), but this would need further support from other maintainers. The other option is to make this work on duplicate column names.

rhshadrach · 2024-05-28T22:11:36Z

pandas/tests/groupby/transform/test_transform.py

+def test_transform_with_list_like():
+    df = DataFrame({"col": list("aab"), "val": range(3), "another": range(3)})


Can you add references to each test, e.g.

def test_transform_with_list_like(): # GH#58318

abeltavares requested a review from rhshadrach as a code owner May 18, 2024 17:00

abeltavares force-pushed the ENH/GroupBy.transform-should-accept-similar-arguments-to-GroupBy.agg branch 10 times, most recently from f49fa6d to e8ae75f Compare May 20, 2024 13:11

mroeschke reviewed May 20, 2024

View reviewed changes

mroeschke added Groupby Apply Apply, Aggregate, Transform labels May 20, 2024

abeltavares force-pushed the ENH/GroupBy.transform-should-accept-similar-arguments-to-GroupBy.agg branch 3 times, most recently from e63be2e to e340efb Compare May 20, 2024 20:24

rhshadrach requested changes May 20, 2024

View reviewed changes

abeltavares force-pushed the ENH/GroupBy.transform-should-accept-similar-arguments-to-GroupBy.agg branch from e340efb to f2abbdf Compare May 21, 2024 07:37

abeltavares requested a review from rhshadrach May 21, 2024 07:38

rhshadrach reviewed May 22, 2024

View reviewed changes

abeltavares added 4 commits May 23, 2024 20:48

Add Enhancement to latest vX.X.X.rst

ec778ce

Implementation

e39d8a4

Tests

a1b1d3d

add dictionary option and fix list issue

78b6e90

abeltavares force-pushed the ENH/GroupBy.transform-should-accept-similar-arguments-to-GroupBy.agg branch from 5d02596 to 78b6e90 Compare May 23, 2024 19:48

abeltavares added 2 commits May 23, 2024 21:58

fix mypy

f90a3e5

ignore_index

324f38c

abeltavares requested review from rhshadrach and mroeschke May 23, 2024 22:31

remove accidental file

888e2ea

rhshadrach requested changes May 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enh/group by.transform should accept similar arguments to group by.agg #58773

Enh/group by.transform should accept similar arguments to group by.agg #58773

abeltavares commented May 18, 2024

mroeschke May 20, 2024

abeltavares May 20, 2024 •

edited

mroeschke May 20, 2024

abeltavares May 20, 2024 •

edited

mroeschke May 20, 2024

abeltavares May 23, 2024

mroeschke May 23, 2024

abeltavares May 23, 2024

mroeschke May 20, 2024

abeltavares May 20, 2024

mroeschke May 20, 2024

abeltavares May 20, 2024 •

edited

rhshadrach May 20, 2024

abeltavares May 21, 2024

rhshadrach May 20, 2024

rhshadrach May 20, 2024

abeltavares May 21, 2024

rhshadrach May 22, 2024

abeltavares May 23, 2024 •

edited

rhshadrach May 20, 2024

abeltavares May 21, 2024

rhshadrach May 20, 2024

abeltavares May 21, 2024

rhshadrach May 20, 2024

abeltavares May 21, 2024 •

edited

rhshadrach left a comment

rhshadrach left a comment

rhshadrach May 28, 2024

rhshadrach May 28, 2024

rhshadrach May 28, 2024

	output = concat(results, axis=1)
	output = concat(results, ignore_index=True, axis=1)

		def test_transform_with_list_like():
		df = DataFrame({"col": list("aab"), "val": range(3), "another": range(3)})

Enh/group by.transform should accept similar arguments to group by.agg #58773

Are you sure you want to change the base?

Enh/group by.transform should accept similar arguments to group by.agg #58773

Conversation

abeltavares commented May 18, 2024

Choose a reason for hiding this comment

abeltavares May 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeltavares May 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeltavares May 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeltavares May 23, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeltavares May 21, 2024 • edited

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeltavares May 20, 2024 •

edited

abeltavares May 20, 2024 •

edited

abeltavares May 20, 2024 •

edited

abeltavares May 23, 2024 •

edited

abeltavares May 21, 2024 •

edited