Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: pd.series.sample when n > len #57584

Closed
2 of 3 tasks
marctorsoc opened this issue Feb 23, 2024 · 3 comments
Closed
2 of 3 tasks

ENH: pd.series.sample when n > len #57584

marctorsoc opened this issue Feb 23, 2024 · 3 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Error Reporting Incorrect or improved errors from pandas Series Series data structure

Comments

@marctorsoc
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Oftentimes, I have some code that samples from a df / series, as in

df.sample(sample_size, random_state=88)

and given new data, sample_size > len(df), getting

...Cannot take a larger sample than population...

I'd like a way to specify that if sample_size > len(df), I just want all elements back. This is already the case with .head(). So I don't really understand why the behaviour is not the same here.

Feature Description

I can see two possible solutions

  1. As in here, "If n is larger than the number of rows, this function returns all rows."

  2. If keping back-compatibility is a must, then adding a parameter errors, with possible values ignore or raise (default being raise, again to keep back-compatibility).

I'd lean towards (1), but I'd be content with (2)

Alternative Solutions

I proposed two solutions. Of course, one can always add a line

sample_size = min(len(df), sample_size)

before the call, but I honestly think pandas should provide support for this common case + being consistent with other methods e.g. head.

Additional Context

Happy to contribute with this feature, but first checking here. Just making sure owners would consider it, and nobody is working on this atm.

@marctorsoc marctorsoc added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 23, 2024
@lithomas1
Copy link
Member

IIRC the error comes from numpy.

This is what I get

  File "/Users/thomasli/pandas/pandas/core/sample.py", line 153, in sample
    return random_state.choice(obj_len, size=size, replace=replace, p=weights).astype(
  File "numpy/random/mtrand.pyx", line 1001, in numpy.random.mtrand.RandomState.choice
ValueError: Cannot take a larger sample than population when 'replace=False

In that case, in my opinion, I think it's probably best for you to add an if check in your code before the call to sample to restrict the sample size to the length of the dataframe.

@lithomas1 lithomas1 added Error Reporting Incorrect or improved errors from pandas Series Series data structure Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 20, 2024
@marctorsoc
Copy link
Author

yeah, for the cases of production code, of course I can add an if. But when exploring on a notebook with constructions like this (imagine this is not static code, but iterating with many filters on it):

(
         df
         .loc[lambda df: df.price.gt(10)]
         .loc[lambda df: df.date.lt("2023-04")
         .sample(10)
)

Here you cannot do an if before because there's no before. I happen to do this very very often. I guess one solution would be ok, create a pipe and do

(
         df
         .loc[lambda df: df.price.gt(10)]
         .loc[lambda df: df.date.lt("2023-04")
         .pipe(sample_pipe, n=10)
)

containing the if... but I continue thinking that pandas should support my use case 🤷‍♂️

@mroeschke
Copy link
Member

Yeah especially since this is an error from numpy, I don't think it's appropriate for pandas to work around this so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Error Reporting Incorrect or improved errors from pandas Series Series data structure
Projects
None yet
Development

No branches or pull requests

3 participants