Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DateTimeIndex.is_year_start unexpected behavior when constructed with freq 'MS' date_range (#57377) #57494

Merged
merged 16 commits into from
Jun 8, 2024

Conversation

mattheeter
Copy link
Contributor

@mattheeter mattheeter commented Feb 19, 2024

  • closes BUG: DatetimeIndex.is_year_start returns incorrect array if .freq == 'MS' #49606

  • Tests added and passed if fixing a bug or adding a new feature

  • All code checks passed.

  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

  • A DateTimeIndex created via date_range with the freq parameter as 'MS' would yield a wrong is_year_start (and later it was discovered is_quarter_start) accessor. In pandas/_libs/tslibs/fields.pyx, the way in which the start and end months were set was one way for freq 'MS', 'QS', and 'YS' and another way for other offsets. However, only the latter two had the necessary properties to set start and end this way. As such, removing 'MS' from the list fixed the issue.

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Mar 25, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Mar 25, 2024
@MarcoGorelli
Copy link
Member

wait sorry, I think this is my fault for not having got round to reviewing it

reopening for now, hoping to get to it at some point before 3.0

@natmokval fancy taking a look too?

@MarcoGorelli MarcoGorelli reopened this Mar 25, 2024
@natmokval
Copy link
Contributor

natmokval commented Mar 25, 2024

Thank you for working on this @mattheeter.

It looks good to me. I think the suggested changes will fix the bug in is_year_start for MonthBegin. We will get the correct result for freq="MS":

dr = pd.date_range("2017-12-01", periods=2, freq="MS")
print(dr) # DatetimeIndex(['2017-12-01', '2018-01-01'], dtype='datetime64[ns]', freq='MS')
dr_attr = dr.is_year_start
print(dr_attr) # [False  True]

These changes won't fix a problem with some other frequencies such as: freq="QS-NOV":

dr = pd.date_range("2017-07-01", periods=2, freq="QS-NOV")
print(dr) # DatetimeIndex(['2017-08-01', '2017-11-01'], dtype='datetime64[ns]', freq='QS-NOV')
dr_comp = [dt.is_year_start for dt in dr]
dr_attr = dr.is_year_start
print(dr_attr) # [False  True]
print(dr_comp) # [False, False]

It may not be necessary to correct this in the current pull request.

@mattheeter
Copy link
Contributor Author

Of course! @natmokval

Yes, the original issue did not have the QS freq mentioned. I did point it out on the issue, and went ahead with what I thought would fix it there. I think that when I tested with other types of QS (e.g. November) I assumed that these should set the year start attribute to that of the quarter start, making me miss the fact that it was incorrect.

Should I continue to work on the QS bug, or just remove any changes that attempted to fix that one since the issue didn't reference it?

@natmokval
Copy link
Contributor

I think it's okay to leave the fix for "MS" in this PR and open a new PR to work on the problem for frequencies such as "QS-NOV".

This PR is still Stale. It can take a couple of days to change its status to active.

@MarcoGorelli MarcoGorelli removed the Stale label Apr 9, 2024
@mattheeter
Copy link
Contributor Author

Hi @natmokval, I was wondering if you knew anything about the Numpy Dev tests that are failing? I know when I submitted this PR a while ago there were no issues with unit tests, but I just got some time to continue working on this but haven't made any code changes since.

@mroeschke mroeschke added the Frequency DateOffsets label Apr 23, 2024
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your PR - I think a more general fix is needed here, rather than papering over already-broken logic

EDIT: I've made a separate issue for the other problem: #58523

Comment on lines 256 to 258
if (freqstr[0:2] in ["MS", "QS", "YS"]) or (
freqstr[1:3] in ["MS", "QS", "YS"]):
# According to the above comment, the first conditional is for QS and YS only
if (freqstr[0:2] in ["QS", "YS"]) or (
freqstr[1:3] in ["QS", "YS"]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is already broken on main - it seems to assume that n is only a single digit

and so it produces this kind of nonsense:

In [39]: dr = pd.date_range("2017-01-01", periods=2, freq="7YS")

In [40]: dr
Out[40]: DatetimeIndex(['2017-01-01', '2024-01-01'], dtype='datetime64[ns]', freq='7YS-JAN')

In [41]: dr.is_year_start
Out[41]: array([ True,  True])

In [42]: dr = pd.date_range("2017-01-01", periods=2, freq="10YS")

In [43]: dr
Out[43]: DatetimeIndex(['2017-01-01', '2027-01-01'], dtype='datetime64[ns]', freq='10YS-JAN')

In [44]: dr.is_year_start
Out[44]: array([False, False])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, looks like this broken logic was introduced 13 years ago... 😭

https://github.com/pandas-dev/pandas/pull/4823/files

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mattheeter for your PR, thanks for your patience here, and sorry that this took a while

Looks good to me. I've added a hypothesis test too to make sure

@MarcoGorelli MarcoGorelli added this to the 3.0 milestone Jun 6, 2024
@MarcoGorelli MarcoGorelli merged commit 81a44fa into pandas-dev:main Jun 8, 2024
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Frequency DateOffsets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DatetimeIndex.is_year_start returns incorrect array if .freq == 'MS'
4 participants