ENH: sum() should default to numeric_only=True #58234

travisturenne · 2024-04-12T16:02:43Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

For exploratory analysis and data exploration having .sum() default to all fields out of the groupby (including text fields) is super annoying for the core community. Right now the existing functionality concatenates all string fields outside of the groupby. Not only is this such an edge case (Under what circumstances would you want to concatenate every single string value in a text field with no separator?) it is super computationally expensive. Many IDEs for Jupyter notebooks crash when .sum() is used, which requires the user to restart their Jupyter notebook instance.

Is it more "pythonic"? Arguably. Sure the + operator concatenates strings in core python, but SUM is a human word that means

the aggregate of two or more numbers, magnitudes, quantities, or particulars as determined by or as if by the mathematical process of addition

A sum is not two strings concatenated.

Perhaps this could be reverted to help out the majority of analysts/data scientists that just want a convenient tool that works as expected by humans and not an idealistic pure python implementation that exists just to satisfy a textbook.

Feature Description

.sum(numeric_only=True) is the default setting for .sum()

Alternative Solutions

remove numeric_only as an option for .sum() and create a string_concat option or require users to use a lambda function for string concatenation.

ie: df.groupby(['date']).string_concat(seperator=',')

Additional Context

No response

asishm · 2024-04-12T17:19:42Z

You may want to read through #46072

Do you have any data that backs this up?

Perhaps this could be reverted to help out the majority of analysts/data scientists that just want a convenient tool that works as expected by humans and not an idealistic pure python implementation that exists just to satisfy a textbook.

travisturenne · 2024-04-12T18:24:19Z

You may want to read through #46072

Wouldn't it have made more sense to treat timedelta as numeric rather than appending every string in a dataframe upon invoking sum()?

rhshadrach · 2024-04-12T22:09:09Z

Thanks for raising this issue!

Many IDEs for Jupyter notebooks crash when .sum() is used, which requires the user to restart their Jupyter notebook instance.

Can you provide an example that causes your notebook to crash?

When users see code:

df[["A", "B"]].sum()

I think it's reasonable for them to expect the result contains two rows indexed by "A" and "B". This would not be the case if we used numeric_only=True as the default: data would be dropped without any indication from the code. On the other hand:

df[["A", "B"]].sum(numeric_only=True)

makes it very clear that columns may be dropped.

In general, I am against the silent dropping of data, and I think that applies here.

travisturenne · 2024-04-16T13:07:12Z

Concatenating strings is a very time-consuming process and on VS code these processes when terminated often create problems with the kernel.

In general, I am against the silent dropping of data, and I think that applies here.

This is probably a good generalized idea, but other than in the bug thread that this issue helps resolve (which could be resolved more efficiently by treating timedeltas as numeric) what is the exact use case of concatenating all strings in a dataframe. Who would ever want that? Imagine you're looking at a dataframe with a million rows. When would I ever need to concatenate all the strings in the 'City', 'Employee' or 'Department' field to create a multimillion character string with no seperator. Whereas in the past .sum() would have quickly described the total summable values of such as dataframe, now it instead attempts an unexpected concatenation that takes multiple days to execute.

Why not just return a type error here?
ValueError: could not convert string to float: 'useless_string'

rhshadrach · 2024-04-17T02:36:25Z

what is the exact use case of concatenating all strings in a dataframe

I've used it to condense a DataFrame into a summary, e.g.

df = pd.DataFrame(
    {
        "route1": ["A", "B", "C"],
        "route2": ["A", "D", "B"],
    }
)
print((df + " -> ").sum().str[:-4])
# route1    A -> B -> C
# route2    A -> D -> B
# dtype: object

Why not just return a type error here?

Currently by default, pandas stores strings as a NumPy object dtype. I do not think we should differ in behavior from NumPy here.

arr = np.array(["a", "b", "c"], dtype=object)
print(arr.sum())
# abc

In addition, I believe there is no way for us to know that a column of dtype object contains all strings without inspecting every value.

On the other hand, with string[python] or string[pyarrow], you do indeed get a TypeError:

ser = pd.Series(["a", "b", "c"], dtype="string[python]")
ser.sum()

# or

ser = pd.Series(["a", "b", "c"], dtype="string[pyarrow]")
ser.sum()

mroeschke · 2024-05-19T18:29:43Z

Yeah agreed that a default that would automatically drop columns would be surprising by default and not a great user experience. Thanks for the suggestion but closing

travisturenne added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 12, 2024

rhshadrach added API Design Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 12, 2024

rhshadrach added the Closing Candidate May be closeable, needs more eyeballs label Apr 19, 2024

mroeschke closed this as completed May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: sum() should default to numeric_only=True #58234

ENH: sum() should default to numeric_only=True #58234

travisturenne commented Apr 12, 2024

asishm commented Apr 12, 2024

travisturenne commented Apr 12, 2024 •

edited

rhshadrach commented Apr 12, 2024 •

edited

travisturenne commented Apr 16, 2024

rhshadrach commented Apr 17, 2024

mroeschke commented May 19, 2024

ENH: sum() should default to numeric_only=True #58234

ENH: sum() should default to numeric_only=True #58234

Comments

travisturenne commented Apr 12, 2024

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

asishm commented Apr 12, 2024

travisturenne commented Apr 12, 2024 • edited

rhshadrach commented Apr 12, 2024 • edited

travisturenne commented Apr 16, 2024

rhshadrach commented Apr 17, 2024

mroeschke commented May 19, 2024

travisturenne commented Apr 12, 2024 •

edited

rhshadrach commented Apr 12, 2024 •

edited