Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix bad datetime64[ns] array to str dtype conversion in Series ctor which causes odd read_csv behaviour #57937

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ruimamaral
Copy link
Contributor

@ruimamaral ruimamaral commented Mar 20, 2024

Dates were being converted to nanosecond timestamps by numpy's ndarray.astype when converting an M8[ns] array to object during ensure_string_array. By wrapping the array with a DatetimeArray EA, we make sure that the dates get nicely converted in ensure_string_array.

Also added a check to ensure_string_array since it was converting na values disregarding the convert_na_value argument which caused missing dates to come out as NaN instead of NaT.

@ruimamaral ruimamaral force-pushed the #57512-bad-datetime-str-conversion-in-series-ctor branch from f9d4161 to 9bde51b Compare March 20, 2024 21:52
@mroeschke mroeschke added the Dtype Conversions Unexpected or buggy dtype conversions label Apr 9, 2024
@ruimamaral ruimamaral force-pushed the #57512-bad-datetime-str-conversion-in-series-ctor branch from acc0370 to 654714d Compare April 15, 2024 21:26
@@ -795,6 +795,7 @@ def _try_cast(
shape = arr.shape
if arr.ndim > 1:
arr = arr.ravel()
arr = ensure_wrapped_if_datetimelike(arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary? seems weird to do this cast before calling ensure_string_array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the test added in this PR address the comment on L792?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure_string_array treats EAs specially, and in this case both checks on lines 748 and 750 would pass, and the array would get converted to object by two consecutive astype calls: first to string and finally to object, which works as expected.

However, if we do not cast it to an EA, the check on line 748 (in ensure_string_array) will fail, which leads to the array being cast directly to object dtype by np.asarray on line 760:

pandas/pandas/_libs/lib.pyx

Lines 748 to 760 in bb0fcc2

if hasattr(arr, "to_numpy"):
if hasattr(arr, "dtype") and arr.dtype.kind in "mM":
# dtype check to exclude DataFrame
# GH#41409 TODO: not a great place for this
out = arr.astype(str).astype(object)
out[arr.isna()] = na_value
return out
arr = arr.to_numpy(dtype=object)
elif not util.is_array(arr):
arr = np.array(arr, dtype="object")
result = np.asarray(arr, dtype="object")

This is especially problematic for M8[ns] arrays since numpy converts the dates into nanosecond timestamps (strangely this doesn't seem to happen with any other M8 dtype array, just the nanosecond one).

The resulting Series will then contain these timestamps, instead of the usual YYYY-MM-DD HH:MM:SS format, which cannot be easily converted back to dt64.

I thought casting it to DatetimeArray was a good way of preventing this from happening, but you know better than me so I'm open to suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the test added in this PR address the comment on L792?

I guess it does, would you like me to remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel Hi, sorry to be a bother, but would it be possible to get some feedback?
Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the nanosecond behavior diverge from other units for M8 arrays? I see you hinting at that but I'm not sure if it has been identified already.

I do agree it is strange for only nanoseconds to diverge from the other units

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so, currently, whenever we cast an M8 array to string using the Series constructor we get a Series with object dtype containing stringified dates. We can recover the original timestamps from this Series easily.

Here is the described behaviour using a M8[ms] array (there is no particular reason for picking milliseconds since this next behaviour is observed with any precision other than nanoseconds from what I have seen):

>>> arr = np.array([
...     ('2024-03-17T00:00:00.000000000'),
...     ('2024-03-18T00:00:00.000000000'),
... ], dtype='M8[ms]')
>>> arr
array(['2024-03-17T00:00:00.000', '2024-03-18T00:00:00.000'],
      dtype='datetime64[ms]')
>>> s1 = pd.Series(arr, dtype=str)
>>> s1
0    2024-03-17 00:00:00
1    2024-03-18 00:00:00
dtype: object
>>> s2 = pd.Series(s1, dtype='M8[ns]')
>>> s2
0   2024-03-17
1   2024-03-18
dtype: datetime64[ns]

However, if (and only if) the array has nanosecond precision, we get a Series with object dtype containing unix timestamps instead of the usual stringified date format. The original timestamps cannot be easily recovered from the resulting Series:

>>> arr = np.array([
...     ('2024-03-17T00:00:00.000000000'),
...     ('2024-03-18T00:00:00.000000000'),
... ], dtype='M8[ns]')
>>> arr
array(['2024-03-17T00:00:00.000000000', '2024-03-18T00:00:00.000000000'],
      dtype='datetime64[ns]')
>>> s1 = pd.Series(arr, dtype=str)
>>> s1
0    1710633600000000000
1    1710720000000000000
dtype: object
>>> s2 = pd.Series(s1, dtype='M8[ns]')
>>> s2
0    1710633600000000000
1    1710720000000000000
dtype: object

This behaviour seems to stem from numpy's conversions as I mentioned in another comment.
I am not sure whether or not this is intentional but, to me, it seems a bit odd and I feel like the user would not be expecting to see such a divergence.
This PR addresses this quirk and makes it so nanoseconds are handled the same way as the other M8 types.

],
dtype=dtype,
)
result = Series(Series(dt_arr, dtype=str), dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little unsure about this change and the issue from the OP - what the is reason for trying to have these losslessly convert between strings and datetimes? If you remove the string conversions things work fine right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feeback and apologies for the delayed response.

While investigating the original problem, I tried to find the origin of the timestamps we see in the OP, and I found out that the odd read_csv behaviour started because of some refactoring that was done it.

I narrowed it down to it being the Series constructor in this refactored code that ends up generating the timestamps when called with a dt64[ns] nparray and dtype=str:

if dtype is not None:
new_col_dict = {}
for k, v in col_dict.items():
d = (
dtype[k]
if pandas_dtype(dtype[k]) in (np.str_, np.object_)
else None
)
new_col_dict[k] = Series(v, index=index, dtype=d, copy=False)

I thought this behaviour from the Series constructor was not intended since it seems to be isolated and stands out from other combinations of types:

For example, doing the exact same ctor call but with a dt64 nparray that has any precision level other than nanoseconds works fine, and the resulting Series is able to be losslessly converted back to a dt64 Series (without the fix, the test case I added already passes whenever it runs with an array of an M8 type that is not the nanosecond one).
The same behaviour can be observed when calling it with a dt64[ns] nparray and dtype=object instead of str.

This seemed like the standard and it made sense to me that we could recover the dates from the resulting strings.

I thought the best solution to this would be to deal with the root cause (assuming it's unintended behaviour) instead of trying to patch read_csv to work as before.

I'm not 100% sure if I understood what you meant by removing the string conversions, however, if we do not call the ctor with dtype=str, everything works as expected since the problem only arises when we try to convert dt64[ns] dates in an nparray to string dtype using the Series ctor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure if I understood what you meant by removing the string conversions, however, if we do not call the ctor with dtype=str, everything works as expected since the problem only arises when we try to convert dt64[ns] dates in an nparray to string dtype using the Series ctor.

Yea ultimately I am trying to figure out what a user is expecting to have happen when specifying dtype=str alongside parse_dates= - are they trying to get back a date or a string? I understand this may have worked previously but on the flip side if its ambiguous what to expect we may decide we want to warn / deprecate when receiving conflicting arguments like htat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, and I totally agree, I feel like it would be better if the user was at least warned about the incompatible arguments they specified.
But I'd be interested in hearing your thoughts on the Series constructor behaviour by itself, since, to me, it seems a bit unintuitive how it treats M8[ns] arrays differently from other M8 type arrays.

@ruimamaral ruimamaral requested a review from WillAyd May 24, 2024 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Unexpected read_csv parse_dates behavior
4 participants