Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: sum() should default to numeric_only=True #58234

Closed
1 of 3 tasks
travisturenne opened this issue Apr 12, 2024 · 6 comments
Closed
1 of 3 tasks

ENH: sum() should default to numeric_only=True #58234

travisturenne opened this issue Apr 12, 2024 · 6 comments
Labels
API Design Closing Candidate May be closeable, needs more eyeballs Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@travisturenne
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

For exploratory analysis and data exploration having .sum() default to all fields out of the groupby (including text fields) is super annoying for the core community. Right now the existing functionality concatenates all string fields outside of the groupby. Not only is this such an edge case (Under what circumstances would you want to concatenate every single string value in a text field with no separator?) it is super computationally expensive. Many IDEs for Jupyter notebooks crash when .sum() is used, which requires the user to restart their Jupyter notebook instance.

Is it more "pythonic"? Arguably. Sure the + operator concatenates strings in core python, but SUM is a human word that means

the aggregate of two or more numbers, magnitudes, quantities, or particulars as determined by or as if by the mathematical process of addition

A sum is not two strings concatenated.

Perhaps this could be reverted to help out the majority of analysts/data scientists that just want a convenient tool that works as expected by humans and not an idealistic pure python implementation that exists just to satisfy a textbook.

Feature Description

.sum(numeric_only=True) is the default setting for .sum()

Alternative Solutions

remove numeric_only as an option for .sum() and create a string_concat option or require users to use a lambda function for string concatenation.

ie: df.groupby(['date']).string_concat(seperator=',')

Additional Context

No response

@travisturenne travisturenne added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 12, 2024
@asishm
Copy link
Contributor

asishm commented Apr 12, 2024

You may want to read through #46072

Do you have any data that backs this up?

Perhaps this could be reverted to help out the majority of analysts/data scientists that just want a convenient tool that works as expected by humans and not an idealistic pure python implementation that exists just to satisfy a textbook.

@travisturenne
Copy link
Author

travisturenne commented Apr 12, 2024

You may want to read through #46072

Wouldn't it have made more sense to treat timedelta as numeric rather than appending every string in a dataframe upon invoking sum()?

@rhshadrach
Copy link
Member

rhshadrach commented Apr 12, 2024

Thanks for raising this issue!

Many IDEs for Jupyter notebooks crash when .sum() is used, which requires the user to restart their Jupyter notebook instance.

Can you provide an example that causes your notebook to crash?

When users see code:

df[["A", "B"]].sum()

I think it's reasonable for them to expect the result contains two rows indexed by "A" and "B". This would not be the case if we used numeric_only=True as the default: data would be dropped without any indication from the code. On the other hand:

df[["A", "B"]].sum(numeric_only=True)

makes it very clear that columns may be dropped.

In general, I am against the silent dropping of data, and I think that applies here.

@rhshadrach rhshadrach added API Design Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 12, 2024
@travisturenne
Copy link
Author

Concatenating strings is a very time-consuming process and on VS code these processes when terminated often create problems with the kernel.

In general, I am against the silent dropping of data, and I think that applies here.

This is probably a good generalized idea, but other than in the bug thread that this issue helps resolve (which could be resolved more efficiently by treating timedeltas as numeric) what is the exact use case of concatenating all strings in a dataframe. Who would ever want that? Imagine you're looking at a dataframe with a million rows. When would I ever need to concatenate all the strings in the 'City', 'Employee' or 'Department' field to create a multimillion character string with no seperator. Whereas in the past .sum() would have quickly described the total summable values of such as dataframe, now it instead attempts an unexpected concatenation that takes multiple days to execute.

Why not just return a type error here?
ValueError: could not convert string to float: 'useless_string'

@rhshadrach
Copy link
Member

what is the exact use case of concatenating all strings in a dataframe

I've used it to condense a DataFrame into a summary, e.g.

df = pd.DataFrame(
    {
        "route1": ["A", "B", "C"],
        "route2": ["A", "D", "B"],
    }
)
print((df + " -> ").sum().str[:-4])
# route1    A -> B -> C
# route2    A -> D -> B
# dtype: object

Why not just return a type error here?

Currently by default, pandas stores strings as a NumPy object dtype. I do not think we should differ in behavior from NumPy here.

arr = np.array(["a", "b", "c"], dtype=object)
print(arr.sum())
# abc

In addition, I believe there is no way for us to know that a column of dtype object contains all strings without inspecting every value.

On the other hand, with string[python] or string[pyarrow], you do indeed get a TypeError:

ser = pd.Series(["a", "b", "c"], dtype="string[python]")
ser.sum()

# or

ser = pd.Series(["a", "b", "c"], dtype="string[pyarrow]")
ser.sum()

@rhshadrach rhshadrach added the Closing Candidate May be closeable, needs more eyeballs label Apr 19, 2024
@mroeschke
Copy link
Member

Yeah agreed that a default that would automatically drop columns would be surprising by default and not a great user experience. Thanks for the suggestion but closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Closing Candidate May be closeable, needs more eyeballs Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants