BUG: Fix bad datetime64[ns] array to str dtype conversion in Series ctor which causes odd read_csv behaviour #57937

ruimamaral · 2024-03-20T21:35:33Z

closes BUG: Unexpected read_csv parse_dates behavior #57512
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Dates were being converted to nanosecond timestamps by numpy's ndarray.astype when converting an M8[ns] array to object during ensure_string_array. By wrapping the array with a DatetimeArray EA, we make sure that the dates get nicely converted in ensure_string_array.

Also added a check to ensure_string_array since it was converting na values disregarding the convert_na_value argument which caused missing dates to come out as NaN instead of NaT.

… test (pandas-dev#57512)

jbrockmendel · 2024-04-16T16:26:52Z

pandas/core/construction.py

@@ -795,6 +795,7 @@ def _try_cast(
            shape = arr.shape
            if arr.ndim > 1:
                arr = arr.ravel()
+            arr = ensure_wrapped_if_datetimelike(arr)


why is this necessary? seems weird to do this cast before calling ensure_string_array

does the test added in this PR address the comment on L792?

ensure_string_array treats EAs specially, and in this case both checks on lines 748 and 750 would pass, and the array would get converted to object by two consecutive astype calls: first to string and finally to object, which works as expected.

However, if we do not cast it to an EA, the check on line 748 (in ensure_string_array) will fail, which leads to the array being cast directly to object dtype by np.asarray on line 760:

pandas/pandas/_libs/lib.pyx

Lines 748 to 760 in bb0fcc2

if hasattr(arr, "to_numpy"):

if hasattr(arr, "dtype") and arr.dtype.kind in "mM":

# dtype check to exclude DataFrame

# GH#41409 TODO: not a great place for this

out = arr.astype(str).astype(object)

out[arr.isna()] = na_value

return out

arr = arr.to_numpy(dtype=object)

elif not util.is_array(arr):

arr = np.array(arr, dtype="object")

result = np.asarray(arr, dtype="object")

This is especially problematic for M8[ns] arrays since numpy converts the dates into nanosecond timestamps (strangely this doesn't seem to happen with any other M8 dtype array, just the nanosecond one).

The resulting Series will then contain these timestamps, instead of the usual YYYY-MM-DD HH:MM:SS format, which cannot be easily converted back to dt64.

I thought casting it to DatetimeArray was a good way of preventing this from happening, but you know better than me so I'm open to suggestions.

does the test added in this PR address the comment on L792?

I guess it does, would you like me to remove it?

@jbrockmendel Hi, sorry to be a bother, but would it be possible to get some feedback?
Thanks!

cc @WillAyd @mroeschke

Where does the nanosecond behavior diverge from other units for M8 arrays? I see you hinting at that but I'm not sure if it has been identified already.

I do agree it is strange for only nanoseconds to diverge from the other units

Yeah so, currently, whenever we cast an M8 array to string using the Series constructor we get a Series with object dtype containing stringified dates. We can recover the original timestamps from this Series easily.

Here is the described behaviour using a M8[ms] array (there is no particular reason for picking milliseconds since this next behaviour is observed with any precision other than nanoseconds from what I have seen):

>>> arr = np.array([ ... ('2024-03-17T00:00:00.000000000'), ... ('2024-03-18T00:00:00.000000000'), ... ], dtype='M8[ms]') >>> arr array(['2024-03-17T00:00:00.000', '2024-03-18T00:00:00.000'], dtype='datetime64[ms]') >>> s1 = pd.Series(arr, dtype=str) >>> s1 0 2024-03-17 00:00:00 1 2024-03-18 00:00:00 dtype: object >>> s2 = pd.Series(s1, dtype='M8[ns]') >>> s2 0 2024-03-17 1 2024-03-18 dtype: datetime64[ns]

However, if (and only if) the array has nanosecond precision, we get a Series with object dtype containing unix timestamps instead of the usual stringified date format. The original timestamps cannot be easily recovered from the resulting Series:

>>> arr = np.array([ ... ('2024-03-17T00:00:00.000000000'), ... ('2024-03-18T00:00:00.000000000'), ... ], dtype='M8[ns]') >>> arr array(['2024-03-17T00:00:00.000000000', '2024-03-18T00:00:00.000000000'], dtype='datetime64[ns]') >>> s1 = pd.Series(arr, dtype=str) >>> s1 0 1710633600000000000 1 1710720000000000000 dtype: object >>> s2 = pd.Series(s1, dtype='M8[ns]') >>> s2 0 1710633600000000000 1 1710720000000000000 dtype: object

This behaviour seems to stem from numpy's conversions as I mentioned in another comment.
I am not sure whether or not this is intentional but, to me, it seems a bit odd and I feel like the user would not be expecting to see such a divergence.
This PR addresses this quirk and makes it so nanoseconds are handled the same way as the other M8 types.

…-in-series-ctor

WillAyd · 2024-05-10T19:34:21Z

pandas/tests/series/test_constructors.py

+            ],
+            dtype=dtype,
+        )
+        result = Series(Series(dt_arr, dtype=str), dtype=dtype)


I am a little unsure about this change and the issue from the OP - what the is reason for trying to have these losslessly convert between strings and datetimes? If you remove the string conversions things work fine right?

Thanks for the feeback and apologies for the delayed response.

While investigating the original problem, I tried to find the origin of the timestamps we see in the OP, and I found out that the odd read_csv behaviour started because of some refactoring that was done it.

I narrowed it down to it being the Series constructor in this refactored code that ends up generating the timestamps when called with a dt64[ns] nparray and dtype=str:

pandas/pandas/io/parsers/readers.py

Lines 1872 to 1880 in fb05cc7

if dtype is not None:

new_col_dict = {}

for k, v in col_dict.items():

d = (

dtype[k]

if pandas_dtype(dtype[k]) in (np.str_, np.object_)

else None

)

new_col_dict[k] = Series(v, index=index, dtype=d, copy=False)

I thought this behaviour from the Series constructor was not intended since it seems to be isolated and stands out from other combinations of types:

For example, doing the exact same ctor call but with a dt64 nparray that has any precision level other than nanoseconds works fine, and the resulting Series is able to be losslessly converted back to a dt64 Series (without the fix, the test case I added already passes whenever it runs with an array of an M8 type that is not the nanosecond one).
The same behaviour can be observed when calling it with a dt64[ns] nparray and dtype=object instead of str.

This seemed like the standard and it made sense to me that we could recover the dates from the resulting strings.

I thought the best solution to this would be to deal with the root cause (assuming it's unintended behaviour) instead of trying to patch read_csv to work as before.

I'm not 100% sure if I understood what you meant by removing the string conversions, however, if we do not call the ctor with dtype=str, everything works as expected since the problem only arises when we try to convert dt64[ns] dates in an nparray to string dtype using the Series ctor.

I'm not 100% sure if I understood what you meant by removing the string conversions, however, if we do not call the ctor with dtype=str, everything works as expected since the problem only arises when we try to convert dt64[ns] dates in an nparray to string dtype using the Series ctor.

Yea ultimately I am trying to figure out what a user is expecting to have happen when specifying dtype=str alongside parse_dates= - are they trying to get back a date or a string? I understand this may have worked previously but on the flip side if its ambiguous what to expect we may decide we want to warn / deprecate when receiving conflicting arguments like htat

Ah I see, and I totally agree, I feel like it would be better if the user was at least warned about the incompatible arguments they specified.
But I'd be interested in hearing your thoughts on the Series constructor behaviour by itself, since, to me, it seems a bit unintuitive how it treats M8[ns] arrays differently from other M8 type arrays.

…-in-series-ctor

ruimamaral requested a review from WillAyd as a code owner March 20, 2024 21:35

ruimamaral force-pushed the #57512-bad-datetime-str-conversion-in-series-ctor branch from f9d4161 to 9bde51b Compare March 20, 2024 21:52

mroeschke requested a review from jbrockmendel April 9, 2024 17:12

mroeschke added the Dtype Conversions Unexpected or buggy dtype conversions label Apr 9, 2024

ruimamaral added 3 commits April 15, 2024 22:13

fix bad datetime to str conversion in Series ctor _try_cast and add a…

5a61da8

… test (pandas-dev#57512)

add whatsnew entry

e583bac

rename and move test

654714d

ruimamaral force-pushed the #57512-bad-datetime-str-conversion-in-series-ctor branch from acc0370 to 654714d Compare April 15, 2024 21:26

remove newline

854d2ed

jbrockmendel reviewed Apr 16, 2024

View reviewed changes

ruimamaral requested a review from jbrockmendel April 20, 2024 03:34

ruimamaral added 2 commits April 30, 2024 19:00

Merge branch 'main' into pandas-dev#57512-bad-datetime-str-conversion…

4fc6eac

…-in-series-ctor

remove newline

eccf7a2

WillAyd reviewed May 10, 2024

View reviewed changes

ruimamaral added 2 commits May 13, 2024 22:29

Merge branch 'main' into pandas-dev#57512-bad-datetime-str-conversion…

947f640

…-in-series-ctor

Merge branch 'main' into pandas-dev#57512-bad-datetime-str-conversion…

57cca23

…-in-series-ctor

ruimamaral requested a review from WillAyd May 24, 2024 09:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix bad datetime64[ns] array to str dtype conversion in Series ctor which causes odd read_csv behaviour #57937

BUG: Fix bad datetime64[ns] array to str dtype conversion in Series ctor which causes odd read_csv behaviour #57937

ruimamaral commented Mar 20, 2024 •

edited

jbrockmendel Apr 16, 2024

jbrockmendel Apr 16, 2024

ruimamaral Apr 16, 2024 •

edited

ruimamaral Apr 16, 2024

ruimamaral Apr 29, 2024

ruimamaral May 10, 2024

WillAyd May 15, 2024

ruimamaral May 19, 2024 •

edited

WillAyd May 10, 2024

ruimamaral May 13, 2024 •

edited

WillAyd May 13, 2024

ruimamaral May 14, 2024 •

edited

	if hasattr(arr, "to_numpy"):

	if hasattr(arr, "dtype") and arr.dtype.kind in "mM":
	# dtype check to exclude DataFrame
	# GH#41409 TODO: not a great place for this
	out = arr.astype(str).astype(object)
	out[arr.isna()] = na_value
	return out
	arr = arr.to_numpy(dtype=object)
	elif not util.is_array(arr):
	arr = np.array(arr, dtype="object")

	result = np.asarray(arr, dtype="object")

	if dtype is not None:
	new_col_dict = {}
	for k, v in col_dict.items():
	d = (
	dtype[k]
	if pandas_dtype(dtype[k]) in (np.str_, np.object_)
	else None
	)
	new_col_dict[k] = Series(v, index=index, dtype=d, copy=False)

BUG: Fix bad datetime64[ns] array to str dtype conversion in Series ctor which causes odd read_csv behaviour #57937

Are you sure you want to change the base?

BUG: Fix bad datetime64[ns] array to str dtype conversion in Series ctor which causes odd read_csv behaviour #57937

Conversation

ruimamaral commented Mar 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruimamaral Apr 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruimamaral May 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruimamaral May 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruimamaral May 14, 2024 • edited

Choose a reason for hiding this comment

ruimamaral commented Mar 20, 2024 •

edited

ruimamaral Apr 16, 2024 •

edited

ruimamaral May 19, 2024 •

edited

ruimamaral May 13, 2024 •

edited

ruimamaral May 14, 2024 •

edited