Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-14: Dedicated string data type for pandas 3.0 #58551

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Changes from 12 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
fbeb69d
PDEP: Dedicated string data type for pandas 3.0
jorisvandenbossche May 3, 2024
f03f54d
small textual edits and typos
jorisvandenbossche May 3, 2024
561de87
address part of the feedback
jorisvandenbossche May 5, 2024
86f4e51
Update web/pandas/pdeps/00xx-string-dtype.md
jorisvandenbossche May 5, 2024
30c7b43
rename file
jorisvandenbossche May 13, 2024
54a43b3
expand Missing value semantics section
jorisvandenbossche May 13, 2024
5b5835b
expand Naming subsection with storage+na_value proposal
jorisvandenbossche May 13, 2024
9ede2e6
Expand Backward compatibility section + add proposal for deprecation
jorisvandenbossche May 13, 2024
f5faf4e
update timeline
jorisvandenbossche May 13, 2024
f554909
Apply suggestions from code review
jorisvandenbossche May 13, 2024
ac2d21a
Apply suggestions from code review
jorisvandenbossche May 13, 2024
82027d2
reflow after online edits
jorisvandenbossche May 13, 2024
5b24c24
Update web/pandas/pdeps/0014-string-dtype.md
jorisvandenbossche May 13, 2024
f9c55f4
Apply suggestions from code review
jorisvandenbossche May 13, 2024
2c58c4c
Fixup table (#2)
rhshadrach May 14, 2024
0a68504
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche May 20, 2024
8974c5b
next round of updates (small text updates, add capitalized String alias)
jorisvandenbossche May 20, 2024
cca3a7f
use capitalized alias in the overview table
jorisvandenbossche May 20, 2024
d24a80a
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche Jun 10, 2024
9c5342a
New revision: keep back compat for 'string', introduce 'str' for the …
jorisvandenbossche Jun 10, 2024
b5663cc
Apply suggestions from code review
jorisvandenbossche Jun 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
388 changes: 388 additions & 0 deletions web/pandas/pdeps/0014-string-dtype.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,388 @@
# PDEP-14: Dedicated string data type for pandas 3.0

- Created: May 3, 2024
- Status: Under discussion
- Discussion: https://github.com/pandas-dev/pandas/pull/58551
- Author: [Joris Van den Bossche](https://github.com/jorisvandenbossche)
- Revision: 1

## Abstract

This PDEP proposes to introduce a dedicated string dtype that will be used by
default in pandas 3.0:

* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available
or otherwise the numpy object-dtype alternative.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarification, isn't the alternative not the "numpy object-dtype alternative", but rather an extension array using numpy objects as strings, with np.nan missing value semantics. You're not proposing that you still get a numpy backed array with object dtype, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not proposing that you still get a numpy backed array with object dtype, right?

Right, definitely not proposing that. I meant the alternative ExtensionArray using numpy object-dtype under the hood. Will need to clarify that.

* The default string dtype will use missing value semantics (using NaN) consistent
with the other default data types.

This will give users a long-awaited proper string dtype for 3.0, while 1) not
(yet) making PyArrow a _hard_ dependency, but only a dependency used by default,
and 2) leaving room for future improvements (different missing value semantics,
using NumPy 2.0 strings, etc).

## Background

Currently, pandas by default stores text data in an `object`-dtype NumPy array.
The current implementation has two primary drawbacks. First, `object` dtype is
not specific to strings: any Python object can be stored in an `object`-dtype
array, not just strings, and seeing `object` as the dtype for a column with
strings is confusing for users. Second: this is not efficient (all string
methods on a Series are eventually calling Python methods on the individual
string objects).

To solve the first issue, a dedicated extension dtype for string data has
already been
[added in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#dedicated-string-data-type).
This has always been opt-in for now, requiring users to explicitly request the
dtype (with `dtype="string"` or `dtype=pd.StringDtype()`). The array backing
this string dtype was initially almost the same as the default implementation,
i.e. an `object`-dtype NumPy array of Python strings.

To solve the second issue (performance), pandas contributed to the development
of string kernels in the PyArrow package, and a variant of the string dtype
backed by PyArrow was
[added in pandas 1.3](https://pandas.pydata.org/docs/whatsnew/v1.3.0.html#pyarrow-backed-string-data-type).
This could be specified with the `storage` keyword in the opt-in string dtype
(`pd.StringDtype(storage="pyarrow")`).

Since its introduction, the `StringDtype` has always been opt-in, and has used
the experimental `pd.NA` sentinel for missing values (which was also [introduced
in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)).
However, up to this date, pandas has not yet taken the step to use `pd.NA` by
default, and thus the `StringDtype` deviates in missing value behaviour compared
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
to the default data types.

In 2023, [PDEP-10](https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html)
proposed to start using a PyArrow-backed string dtype by default in pandas 3.0
(i.e. infer this type for string data instead of object dtype). To ensure we
could use the variant of `StringDtype` backed by PyArrow instead of Python
objects (for better performance), it proposed to make `pyarrow` a new required
runtime dependency of pandas.

In the meantime, NumPy has also been working on a native variable-width string
data type, which will be available [starting with NumPy
2.0](https://numpy.org/devdocs/release/2.0.0-notes.html#stringdtype-has-been-added-to-numpy).
This can provide a potential alternative to PyArrow for implementing a string
data type in pandas that is not backed by Python objects.

After acceptance of PDEP-10, two aspects of the proposal have been under
reconsideration:

- Based on user feedback (mostly around installation complexity and size), it
MarcoGorelli marked this conversation as resolved.
Show resolved Hide resolved
has been considered to relax the new `pyarrow` requirement to not be a _hard_
runtime dependency. In addition, NumPy 2.0 could in the future potentially
reduce the need to make PyArrow a required dependency specifically for a
dedicated pandas string dtype.
- PDEP-10 did not consider the usage of the experimental `pd.NA` as a
consequence of adopting one of the existing implementations of the
`StringDtype`.

For the second aspect, another variant of the `StringDtype` was
[introduced in pandas 2.1](https://pandas.pydata.org/docs/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings)
that is still backed by PyArrow but follows the default missing values semantics
pandas uses for all other default data types (and using `NaN` as the missing
value sentinel) ([GH-54792](https://github.com/pandas-dev/pandas/issues/54792)).
At the time, the `storage` option for this new variant was called
`"pyarrow_numpy"` to disambiguate from the existing `"pyarrow"` option using
`pd.NA` (but this PDEP proposes a better naming scheme, see the "Naming"
subsection below).

This last dtype variant is what users currently (pandas 2.2) get for string data
when enabling the ``future.infer_string`` option (to enable the behaviour which
is intended to become the default in pandas 3.0).

## Proposal

To be able to move forward with a string data type in pandas 3.0, this PDEP proposes:

1. For pandas 3.0, a "string" dtype is enabled by default, which will use PyArrow
if installed, and otherwise falls back to an in-house functionally-equivalent
(but slower) version.
2. This default "string" dtype will follow the same behaviour for missing values
as other default data types, and use `NaN` as the missing value sentinel.
3. The version that is not backed by PyArrow can reuse (with minor code
additions) the existing numpy object-dtype backed StringArray for its
implementation.
4. Installation guidelines are updated to clearly encourage users to install
pyarrow for the default user experience.

Those string dtypes enabled by default will then no longer be considered as
experimental.

### Default inference of a string dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is item 5 of the "Proposal" above, with details in this section

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, can you clarify? Do you think there should be added an item 5 in the enumeration above? (for me, this falls under item 1, i.e. what it means to have "string dtype enabled by default", that means we use it in inference and in IO)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't interpret "is enabled by default" that way, but that's probably because the word "default" is used in many contexts in this PDEP. So maybe change item 1 to be more explicit about the meaning of "by default", i.e., change:
"For pandas 3.0, a "str" string dtype is enabled by default" to "For pandas 3.0, a "str" string dtype is used as the default dtype for all text data, in both inference and I/O operations", or something like that.


By default, pandas will infer this new string dtype instead of object dtype for
string data (when creating pandas objects, such as in constructors or IO
functions).

The existing `future.infer_string` option can be used to opt-in to the future
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
default behaviour:

```python
>>> pd.options.future.infer_string = True
>>> pd.Series(["a", "b", None])
0 a
1 b
2 NaN
dtype: string
```

Right now (pandas 2.2), the existing option only enables the PyArrow-based
future dtype. For the remaining 2.x releases, this option will be expanded to
also work when PyArrow is not installed to enable the object-dtype fallback in
that case.

### Missing value semantics

As mentioned in the background section, the original `StringDtype` has used
the experimental `pd.NA` sentinel for missing values. In addition to using
`pd.NA` as the scalar for a missing value, this essentially means
that:

- String columns follow ["NA-semantics"](https://pandas.pydata.org/docs/user_guide/missing_data.html#na-semantics)
for missing values, where `NA` propagates in boolean operations such as
comparisons or predicates.
- Operations on the string column that give a numeric or boolean result use the
nullable Integer/Float/Boolean data types (e.g. `ser.str.len()` returns the
nullable `'Int64"` / `pd.Int64Dtype()` dtype instead of the numpy `int64`
dtype (or `float64` in case of missing values)).

However, up to this date, all other default data types still use `NaN` semantics
for missing values. Therefore, this proposal says that a new default string
dtype should also still use the same default missing value semantics and return
default data types when doing operations on the string column, to be consistent
with the other default dtypes at this point.

In practice, this means that the default `"string"` dtype will use `NaN` as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still -1 on changing this behavior; I do not want to revert "string" back to NumPy nullability semantics; that is a breaking change for anyone that has been using our extension type system to "solve" this issue for the past 5-6 years

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PDEP should also be clear about long term expectations. I still think right now we are assuming:

  • 2.x release - dtype="string" uses pd.NA as a missing value marker
  • 3.x release - dtype="string" uses np.nan as a missing value marker by default, user setting to change to pd.NA
  • 4.x release - dtype="string" changes back to the 2.x behavior

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still -1 on changing this behavior; I do not want to revert "string" back to NumPy nullability semantics

For clarification, do you mean "-1 on using NaN semantics for the default string dtype, regardless of how we name it", or only "-1 on using NaN semantics for the dtype created as dtype="string"" ?

Because it is only the latter that causes the breaking change for anyone already using the nullable string dtype. Assume we would use a different name or different string alias than "string", we could still have a default string dtype (which everyone that was not yet using the nullable StringDtype would get by default) that uses the proposed NaN semantics, while not causing a breaking change for the existing users of dtype="string" / dtype=pd.StringDtype().

(it's another question whether there is enough support for using a different name, I personally think "string" is the best choice which we should reserve for the default dtype, but first to get a good understanding of your position)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarification, do you mean "-1 on using NaN semantics for the default string dtype, regardless of how we name it", or only "-1 on using NaN semantics for the dtype created as dtype="string"" ?

Definitely the latter, maybe the former. My expectation with PDEP-10 was that the default pyarrow string would be using pd.NA. If that is too difficult then yea there probably is a compromise on the former, but I do not want to take away the dtype="string" functionality from users that has been working all of this time.

Not that it is ideal, but we already have dtype=str today and dtype="string"; maybe the former becomes the new name for what is being proposed here instead of string[pyarrow_numpy] and only change dtype="string" to be pyarrow backed without changing nullability semantics?

That doesn't solve the str/"string" discrepancy but I don't think introduces any new problems either

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not want to take away the dtype="string" functionality from users that has been working all of this time.

We are not "taking away" that functionality, the current revision of the PDEP only asks users to use dtype=pd.StringDtype(na_value=pd.NA) instead to continue using the same functionality, and Irv's suggestion would minimize the required code change to use dtype="String"

Not that it is ideal, but we already have dtype=str today and dtype="string"; maybe the former becomes the new name for what is being proposed here instead of string[pyarrow_numpy]

Then what would you propose to show in the df.dtypes output? (i.e. the string repr of the dtype) Also "str" instead of "string"?
That would be an option. In that case, we could also use a separate StrDtype() class for those NaN-variants (which also solves the back compat issue for dtype=pd.StringDtype()).

Copy link
Member

@WillAyd WillAyd May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we can discuss further in PDEP-13 but the current revision of it proposes consistently having the format of <TYPE>Dtype(na_marker=pd.NA|"legacy"), the idea being that long term pd.NA can be used consistently, but we will have a compatability period of "legacy" where you get the mix of np.nan / pd.NaT for NumPy-based types (and still probably pd.NA for any new types like ListDtype).

So StringDtype(na_marker=np.nan) is slightly different from that. Maybe asking users to explicitly say np.nan instead of "legacy" has some downsides from a UI perspective, but my gut feeling is that we can solve that over time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to be wary though of users expecting complete control over the na_marker field. I don't see a value-add in trying to support DatetimeDtype(na_marker=np.nan) alongside DatetimeDtype(na_marker=pd.NaT) nor do I think there would ever be value in ListDtype(na_marker=np.nan)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to be wary though of users expecting complete control over the na_marker field.

I definitely agree users shouldn't expect complete control over the na_marker or na_value field. This is not a custom value that can be anything, and we indeed don't want to generalize that to other dtypes in the future.

Because of that, I have been thinking to not actually allow a user to specify StringDtype(na_marker=np.nan) explicitly, but only allow the implicit default of StringDtype() (using NaN) or the explicit choice to not have the default with StringDtype(na_value=pd.NA).
Just to avoid users actually doing StringDtype(na_marker=np.nan) (while that is not necessary, when it is possible users will still do it), and thinking this will generalize to other dtypes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PDEP should also be clear about long term expectations.

(on your original comment in this thread) I will mention something about this changing again in the future, but I don't want to make it that explicit because at this point we don't know for sure that this will happen and what the timeline would be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still overall -1 on changing the default from pd.NA back to np.nan. The latter is not generalizable (not even in our current design) so I don't see how we can consider that the long term solution

the missing value sentinel, and:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, I don't think the current object behavior always uses NaN:

>>> sn = pd.Series(["abc", "defg", "hijkl"])
>>> sn
0      abc
1     defg
2    hijkl
dtype: object
>>> sn.shift(1)
0    None
1     abc
2    defg
dtype: object
>>> sn.shift(1).iloc[0] is None

So at least with the shift() operation, the "missing value" is None, not np.nan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we are not really consistent with this. When created, we will consider any null-like (None, NaN, NaT, NA) as a missing value in an object-dtype column, but for methods introducing them I would have hoped we were consistent. But apparently not:

>>> ser = pd.Series(["a", "b", "c"], dtype=object)
>>> ser.shift(1)
0    None
1       a
2       b
dtype: object
>>> ser.reindex([1, 2, 3])
1      b
2      c
3    NaN
dtype: object

And to be clear, this will all be NaN with the future string dtype (whether converted to NaN upon construction, or ensuring we use NaN for missing values introduced in methods)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we're not consistent today does suggest that with this proposal (using np.nan as the default missing value indicator for strings), that there will be a behavior change for people using dtype=object today with strings, because we'd replace None with np.nan in shift() (and maybe elsewhere)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned this in the backwards compatibility section


- String columns will follow NaN-semantics for missing values, where `NaN` gives
False in boolean operations such as comparisons or predicates.
- Operations on the string column that give a numeric or boolean result will use
the default data types (i.e. numpy `int64`/`float64`/`bool`).

Because the original `StringDtype` implementations already use `pd.NA` and
return masked integer and boolean arrays in operations, a new variant of the
existing dtypes that uses `NaN` and default data types was needed. The original
variant of `StringDtype` using `pd.NA` will still be available for those who
want to keep using it (see below in the "Naming" subsection for how to specify
this).

### Object-dtype "fallback" implementation

To avoid a hard dependency on PyArrow for pandas 3.0, this PDEP proposes to keep
a "fallback" option in case PyArrow is not installed. The original `StringDtype`
backed by a numpy object-dtype array of Python strings can be mostly reused for
this (adding a new variant of the dtype) and a new `StringArray` subclass only
needs minor changes to follow the above-mentioned missing value semantics
([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)).

For pandas 3.0, this is the most realistic option given this implementation has
already been available for a long time. Beyond 3.0, further improvements such as
using NumPy 2.0 ([GH-58503](https://github.com/pandas-dev/pandas/issues/58503))
or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)) can
still be explored, but at that point that is an implementation detail that
should not have a direct impact on users (except for performance).

### Naming

Given the long history of this topic, the naming of the dtypes is a difficult
topic.

In the first place, it should be acknowledged that most users should not need to
use storage-specific options. Users are expected to specify `pd.StringDtype()`
or `"string"`, and that will give them their default string dtype (which
depends on whether PyArrow is installed or not).

But for testing purposes and advanced use cases that want control over this, we
need some way to specify this and distinguish them from the other string dtypes.
In addition, users that want to continue using the original NA-variant of the
dtype need a way to specify this.

Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used, where
the `"pyarrow_numpy"` storage was used to disambiguate from the existing
`"pyarrow"` option using `pd.NA`. However, "pyarrow_numpy" is a rather confusing
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
option and doesn't generalize well. Therefore, this PDEP proposes a new naming
scheme as outlined below, and "pyarrow_numpy" will be deprecated and removed
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
before pandas 3.0.

The `storage` keyword of `StringDtype` is kept to disambiguate the underlying
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
storage of the string data (using pyarrow or python objects), but an additional
`na_value` is introduced to disambiguate the the variants using NA semantics
mroeschke marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing with na_value I wonder about is if its worth specifying na_value=np.nan or instead using something more generic like na_handling="legacy"

The advantage to the latter is it would work with any type, especially as we think about moving towards the logical type system. pd.DatetimeType(na_value=np.nan) probably should not exist, so a user would either have to opt in to pd.DatetimeType(na_value=pd.NaT) explicitly or gloss over that and just have them use pd.XXXType(na_handling="legacy") when they want historical pandas behavior

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it might be a bit more future looking, I personally wouldn't want to call the default behaviour "legacy". That seems like a strange message, because it will potentially only be legacy in 4.0 (or in later 3.x releases if we provide an option to globally opt in to using pd.NA). But for most users for 3.0 it is simply the default missing value handling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Not tied to the name. Maybe "hybrid" or "mixed"?

and NaN semantics.

Overview of the different ways to specify a dtype and the resulting concrete
dtype of the data:

| User specification | Concrete dtype | String alias | Note |
|----------------------------------------|---------------------------------------------------|-------------------------|------|
| Unspecified (inference) | `StringDtype(storage="pyarrow"|"python", na_value=np.nan)` | "string" | (1) |
| `StringDtype()` or `"string"` | `StringDtype(storage="pyarrow"|"python", na_value=np.nan)` | "string" | (1), (2) |
| `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string" | (2) |
| `StringDtype("python")` | `StringDtype(storage="python", na_value=np.nan)` | "string" | (2) |
| `StringDtype("pyarrow", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[pyarrow]" | |
| `StringDtype("python", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[python]" | |
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
| `StringDtype(na_value=pd.NA)` | `StringDtype(storage="pyarrow"|"python", na_value=pd.NA)` | "string[pyarrow]" or "string[python]" | (1) |
| `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (3) |

Notes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe one more note to add is around dtype_backend="numpy_nullable" which will now return a pyarrow-backed string array with pd.NA semantics when pyarrow is installed.

I consider fixing that out of scope for this PDEP, but I think should at least recognize it as an issue for the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add a paragraph about changing the default storage for StringDtype() with NA in the section above:

For the original variant of StringDtype using pd.NA, currently the default
storage is "python" (the object-dtype based implementation). Also for this
variant, it is proposed to follow the same logic for determining the default
storage, i.e. default to "pyarrow" if available, and otherwise
fall back to "python".

That should cover this as well? Or do you want to explicitly make the connection with the dtype_backend keyword?

At the moment, if you set pyarrow as the default storage (pd.options.mode.string_storage = "pyarrow") and then specify dtype_backend="numpy_nullable" in an IO method, you already get the pyarrow backed StringDtype. So I would considered this covered by the current behaviour when changing the default storage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't realize that behavior existed today as well - I assumed numpy_nullable always just gave back StringDtype()

Not a blocker, just figure it would be nice to add a note that we recognize this inconsistency in the API today as something we want to fix, but not in the scope of this PDEP. Maybe even a note like:

I/O methods often contain a `dtype_backend=` parameter which accepts an argument of `numpy_nullable`. 
Conceptually `numpy_nullable` means "NA-backed data types". The word "numpy" is a misnomer, 
as even today and going forward with this PDEP, the StringDtype returned with that argument may 
not use NumPy at all. This is potentially confusing, but fixing it is out of scope for this PDEP


- (1) You get "pyarrow" or "python" depending on pyarrow being installed.
- (2) Those three rows are backwards incompatible (i.e. they work now but give
the NA-variant), see the "Backward compatibility" section below.
- (3) "pyarrow_numpy" is kept temporarily because this is already in a released
version, but we can deprecate it in 2.x and have it removed for 3.0.
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

For the new default string dtype, only the `"string"` alias can be used to
specify the dtype as a string, i.e. a way would not be provided to make the
underlying storage (pyarrow or python) explicit through the string alias. This
string alias is only a convenience shortcut and for most users `"string"` is
sufficient (they don't need to specify the storage), and the explicit
`pd.StringDtype(...)` is still available for more fine-grained control.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a user should be able to specify dtype="String", and then they get the equivalent of StringDtype(na_value=pd.NA)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the proposal to mention the addition of a "String" string alias for the NA-variant (it's mentioned below in the backwards compatibility section)


## Alternatives

### Why not delay introducing a default string dtype?

To avoid introducing a new string dtype while other discussions and changes are
in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
the default missing value sentinel? using the new NumPy 2.0 capabilities?
overhauling all our dtypes to use a logical data type system?), introducing a
default string dtype could also be delayed until there is more clarity in those
other discussions.

However:

1. Delaying has a cost: it further postpones introducing a dedicated string
dtype that has massive benefits for users, both in usability as (for the
significant part of the user base that has PyArrow installed) in performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you can say "significant" yet. I would delete that word.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you can say "significant" yet. I would delete that word.

Deleted it.

2. In case pandas eventually transitions to use `pd.NA` as the default missing value
sentinel, a migration path for _all_ our data types will be needed, and thus
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
the challenges around this will not be unique to the string dtype and
therefore not a reason to delay this.

Making this change now for 3.0 will benefit the majority of users, while coming
at a cost for a part of the users who already started using the `"string"` or
`pd.StringDtype()` dtype (they will have to update their code to continue to the
variant using `pd.NA`, see the "Backward compatibility" section below).

### Why not use the existing StringDtype with `pd.NA`?

Wouldn't adding even more variants of the string dtype make things only more
confusing? Indeed, this proposal unfortunately introduces more variants of the
string dtype. However, the reason for this is to ensure the actual default user
experience is _less_ confusing, and the new string dtype fits better with the
other default data types.

If the new default string data type would use `pd.NA`, then after some
operations, a user can easily end up with a DataFrame that mixes columns using
`NaN` semantics and columns using `NA` semantics (and thus a DataFrame that
could have columns with two different int64, two different float64, two different
bool, etc dtypes). This would lead to a very confusing default experience.

With the proposed new variant of the StringDtype, this will ensure that for the
_default_ experience, a user will only see only 1 kind of integer dtype, only
kind of 1 bool dtype, etc. For now, a user should only get columns using `pd.NA`
when explicitly opting into this.

### Naming alternatives

This PDEP now keeps the `pd.StringDtype` class constructor with the existing
`storage` keyword and with an additional `na_value` keyword.

During the discussion, several alternatives have been brought up. Both
alternative keyword names as using a different constructor. This PDEP opted to
keep using the existing `pd.StringDtype()` for now to keep the changes as
minimal as possible, leaving a larger overhaul of the dtype system (potentially
including different constructor functions or namespace) for a future discussion.
See [GH-58613](https://github.com/pandas-dev/pandas/issues/58613) for the full
discussion.

## Backward compatibility

The most visible backwards incompatible change will be that columns with string
data will no longer have an `object` dtype. Therefore, code that assumes
`object` dtype (such as `ser.dtype == object`) will need to be updated. This
change is done as a hard break in a major release, as warning in advance for the
changed inference is deemed too noisy.

To allow testing code in advance, the
`pd.options.future.infer_string = True` option is available for users.

Otherwise, the actual string-specific functionality (such as the `.str` accessor
methods) should generally all keep working as is. By preserving the current
missing value semantics, this proposal is also backwards compatible on this
aspect.

### For existing users of `StringDtype`

Users of the existing `StringDtype` will see more backwards incompatible
changes, though. In pandas 3.0, calling `pd.StringDtype()` (or specifying
`dtype="string"`) will start returning the new default string dtype using `NaN`,
while up to now this returned the string dtype using `pd.NA` introduced in
pandas 1.0.

For example, this code snippet returned the NA-variant of `StringDtype` with
pandas 1.x and 2.x:

```python
>>> pd.Series(["a", "b", None], dtype="string")
0 a
1 b
2 <NA>
dtype: string
```

but will start returning the new default NaN-variant of `StringDtype` with
pandas 3.0. This means that the missing value sentinel will change from `pd.NA`
to `NaN`, and that operations will no longer return nullable dtypes but default
numpy dtypes (see the "Missing value semantics" section above).

While this change will be transparent in many cases (e.g. checking for missing
values with `isna()`/`dropna()`/`fillna()` or filtering rows with the result of
a string predicate method keeps working regardless of the sentinel), this can be
a breaking change if users relied on the exact sentinel or resulting dtype. Since
pandas 1.0, the string dtype has been promoted quite a bit, and so we expect
that many users already have started using this dtype, even though officially
still labeled as "experimental".
Dr-Irv marked this conversation as resolved.
Show resolved Hide resolved

To smooth the upgrade experience for those users, it is proposed to add a
deprecation warning before 3.0 when such dtype is created, giving them two
options:

- If the user just wants to have a dedicated "string" dtype (or the better
performance when using pyarrow) but is fine with using the default NaN
semantics, they can add `pd.options.future.infer_string = True` to their code
to suppress the warning and already opt-in to the future behaviour of pandas
3.0.
rhshadrach marked this conversation as resolved.
Show resolved Hide resolved
- If the user specifically wants the variant of the string dtype that uses
`pd.NA` (and returns nullable numeric/boolean dtypes in operations), they will
have to update their dtype specification from `"string"` / `pd.StringDtype()`
to `pd.StringDtype(na_value=pd.NA)` to suppress the warning and further keep
their code running as is.

## Timeline

The future PyArrow-backed string dtype was already made available behind a feature
flag in pandas 2.1 (enabled by `pd.options.future.infer_string = True`).

Some small enhancements or fixes might still be needed and can continue to be
backported to pandas 2.2.x.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those fixes should be in a 2.3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it seems we haven't had any fixes yet in 2.2.x, we merged several fixes for the future default string dtype mode in 2.1.x (after the initial 2.1.0 release). I would think we can continue doing that for fixes, but can also just leave out this sentence if there is disagreement.

(I think the general rule of this being discussed on a PR basis whether it should be backported or not, depending on how critical the fix is, would apply here, and so that maybe doesn't require explicit mentioning)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those fixes should be in a 2.3

Removed this sentence.


The variant using numpy object-dtype can also be backported to the 2.2.x branch
to allow easier testing. It is proposed to release this as 2.3.0 (created from
the 2.2.x branch, given that the main branch already includes many other changes
targeted for 3.0), together with the deprecation warning when creating a dtype
from `"string"` / `pd.StringDtype()`.

The 2.3.0 release would then have all future string functionality available
(both the pyarrow and object-dtype based variants of the default string dtype),
and warn existing users of the `StringDtype` in advance of 3.0 about how to
update their code.

For pandas 3.0, this `future.infer_string` flag becomes enabled by default.

## PDEP-XX History

- 3 May 2024: Initial version