Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default string dtype (PDEP-14): naming convention to distinguish the dtype variants #58613

Open
jorisvandenbossche opened this issue May 7, 2024 · 27 comments
Labels
API Design Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 7, 2024

Context: the future string dtype for 3.0 (currently enabled with pd.options.futu.infer_string = True) is being formalized in a PDEP in #58551, and one of the discussion points is how to name the different variants of the StringDtype that will exist with the PDEP (whether using pyarrow or numpy object dtype for the data under the hood, and whether using NA or NaN as missing value sentinel).

As explained in #54792 (comment), we introduced the NaN-variant of the dtype for 3.0 as pd.StringDtype(storage="pyarrow_numpy") because we wanted to reuse the storage keyword but "pyarrow" is already taken (by the dtype using pd.NA introduced in pandas 1.3), and because we couldn't think of a better name at the time. But as also mentioned back then, that is far from a great name.

But as mentioned by @jbrockmendel in #58551 (comment), we don't necessarily need to reuse just the storage keyword, but we could also add new keywords to distinguish the dtype variants.

That got me thinking and listing some possible options here:

  • Add an extra keyword that distinguishes the NA sentinel (and with that implicitly the type of missing value semantics):
    • Possible names for pd.StringDtype(storage="python"|"pyarrow", <something>):
      • semantics="numpy" (and the other would then be "nullable" or 'arrow" or ..?)
      • na_value=np.nan
      • na_marker=np.nan
      • missing=np.nan
      • nullable=False (although we have used "nullable dtypes" in the past to denote the dtypes using NA, it's also confusing here because the False variant does support missing values as well)
    • One drawback here that I don't think users should actually ever explicitly do pd.StringDtype(storage="pyarrow", na_value=np.nan) as that is not future proof. But defaulting to na_value=np.nan (to avoid requiring to specify it) is then not backwards compatible with current pd.StringDtype(storage="pyarrow")
  • Add a new keyword separate from storage to determine the storage/backend that only controls the new variants with NaN.
    • Given we are using storage right now, but speak about "backend" in other places, we could add for example a backend keyword, where StringDtype(storage="python"|"pyarrow") keeps resulting in the dtypes using NA (backwards compatible), while doing StringDtype(backend="python"|"pyarrow") gives you the new dtypes using NaN (and specifying both then obviously errors)
    • This is not great API design to have two keywords that are mutually exclusive but are essentially controlling the same thing, but, it does avoid having to specify two keywords (or having the confusing names)
    • One question is which keyword name to use. backend has prior use in the "dtypes_backend" terminology. Irv suggested nature below.
  • For completeness, we can also still come up with a better storage name than "pyarrrow_numpy" and stick to that single existing keyword. Suggestions from the PDEP PR:
    • "pyarrow_nan"
    • "pyarrow_legacy" (I wouldn't go with this one, because for users it is not "legacy" right now, rather it would be the default. It will only become legacy later if we decide on switching to NA later)

After writing this down, I think my current preference would go to StringDtype(backend="python"|"pyarrow"), as that seems the simplest for most users (it's a bit confusing for those who already explicitly used storage, but most users have never done that)

@jorisvandenbossche jorisvandenbossche added API Design Strings String extension data type and string data Needs Discussion Requires discussion from core team before further action labels May 7, 2024
@jorisvandenbossche
Copy link
Member Author

FWIW, there is also a fourth option: not have any keyword for this, and don't give users a way to control this through StringDtype().
Calling the default pd.StringDtype() will under the hood still create a dtype instance either backed by numpy object or pyarrow depending on whether pyarrow is installed, but that choice would then always be done automatically. And we can still have a private API to create the dtype with one of the specific backends for testing.

@simonjayhawkins
Copy link
Member

I guess I was hinting at something similar for the semantics keyword #58551 (comment). i.e. We do don't have that public on the dtype (but as a dtype property) and perhaps control that at the DataFrame (and maybe Series) level, so that a nullable array assigned to a DataFrame would coerce to NumPy semantics.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 7, 2024

It seems to me that there are 2 choices that a user could make, along with options for those choices

  • How strings are stored and manipulated:
    • pyarrow
    • numpy objects (current 2.x behavior)
    • numpy strings (requires numpy 2.0)
  • How missing values are handled:
    • Use np.nan
    • Use pd.NA

Aside from the numpy 2.0 option, I think we have implementation of all combinations of the storage and missing values available, and it seems to me that when we implement support for numpy strings that depends on numpy 2.0, we'd want to support np.nan and pd.NA for missing values.

There are then a few questions to address:

  1. How should a user choose the storage/manipulation/backend and what the keyword should be
  2. How should a user choose the missing value behavior
  3. What should the defaults of each be
  4. What do we do to handle compatibility from version 2.2 to 3.0
  5. What is the class name for specifying these dtypes
  6. What is the string representation of these dtypes

I'd like to suggest the word nature as the keyword to represent the storage/backend and a keyword missing for the missing value. We deprecate the word storage in pd.StringDtype(), and document how the current storage argument for pd.StringDtype() map to nature and missing . That will address questions 1, 2 and 4. The typed signature in the future (without the storage keyword) would be:

class StringDtype(StorageExtensionDtype):
    def __init__(self, nature: Literal["pyarrow", "numpy.object", "numpy.string"] = "pyarrow",
                 missing: np.nan | pd.NA = pd.NA) -> None: ...

Why nature? When you choose one of the 3 options, you are describing not just the storage backend, but also the implementation of the string methods like str.len(), str.split, etc. So it is the "nature" of the storage and the behavior that is being specified.

I'd also like to address question 6, we use a new nomenclature for the strings that represent dtypes corresponding to strings. Let's NOT use the word "numpy" to represent using np.nan for missing values, but be explicit in using the strings "pd.NA" and "np.nan". The resulting strings representing all 6 combinations could then be (with my best guess for the equivalences of today's behavior):

  • "pyarrow_np.nan" (equivalent to "pyarrow_numpy")
  • "pyarrow_pd.NA" (equivalent to storage="pyarrow")
  • "numpy.object_np.nan" (equivalent to dtype="object")
  • "numpy.object_pd.NA" (equivalent to pd.StringDtype() in 2.x ??)
  • "numpy.string_pd.NA" (New)
  • "numpy.string_np.nan" (New)

For the missing value indicators, let's be explicit about whether np.nan or pd.NA is being used whenever we refer to them, because the community at large is going to have to be educated about the difference in the future and using some other "code word" to represent the 2 possibilities is just masking the issue even more (no pun intended).

@simonjayhawkins
Copy link
Member

After writing this down, I think my current preference would go to StringDtype(backend="python"|"pyarrow"), as that seems the simplest for most users (it's a bit confusing for those who already explicitly used storage, but most users have never done that)

I'd like to suggest the word nature as the keyword to represent the storage/backend

some more suggestions: memory, memory_layout or layout

  • How missing values are handled:

For the missing value indicators, let's be explicit about whether np.nan or pd.NA is being used whenever we refer to them

This is maybe not explicit about the behavior in the sense that the nullable string dtypes that return numeric output will always return a nullable integer dtype, rather than either int or float dtype, depending on the presence of NA values. So this behavior gives a more consistent return type and using terms such as missing or na_value do not convey this.

@jbrockmendel
Copy link
Member

backend="python"|"pyarrow"

is "python" an option for the dtype_backend keyword? The keywords are similar enough that it will cause confusion if they don't have matching behavior.

nullable=False (although we have used "nullable dtypes" in the past to denote the dtypes using NA, it's also confusing here because the False variant does support missing values as well)

Yah I think there is confusion as to what "nullable" means depending on the writer/context.

propagation=... would be more explicit than semantics in describing what it controls. The downside is I can never remember whether to spell it "propa" or "propo".

I lean towards storage+na_value.

@jorisvandenbossche
Copy link
Member Author

It seems to me that there are 2 choices that a user could make, along with options for those choices

For the PDEP / pandas 3.0, I personally explicitly do not want typical users to make this choice. In each context (pyarrow installed or not), there is just one default string dtype, and that is all what most uses should worry about.

So while we still need some (public or internal) way to create the different variants explicitly (which is what we are discussing here, so thanks for your comments!), in context of the PDEP I would like to hide that as much as possible. For that reason, I am personally not really a fan of adding an explicit keyword to choose the missing value semantics (like na_value or missing). Or the elaborate string representations that give a lot of details. Users that didn't opt in to any of the experimental options should just see string, and IMO we could even disallow creating the new NaN-variants of the string dtype with a alias other than "string" (like, don't support "string[pyarrow_numpy]" to use the current naming in the main branch)

@jorisvandenbossche
Copy link
Member Author

backend="python"|"pyarrow"

is "python" an option for the dtype_backend keyword? The keywords are similar enough that it will cause confusion if they don't have matching behavior.

No, but currently you also can't choose the "default" dtypes through dtype_backend, that keyword is only used for opting in to a set of non-default dtypes.
Now, it is a good point that backend="python" wouldn't translate to other dtypes like numeric or datetime dtypes, where we would never have the option to store such data as python objects (although in theory for date that could be an option, but not that we actually want to add it I think)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 7, 2024

For the PDEP / pandas 3.0, I personally explicitly do not want typical users to make this choice. In each context (pyarrow installed or not), there is just one default string dtype, and that is all what most uses should worry about.

But at the top of this issue, you wrote:

one of the discussion points is how to name the different variants of the StringDtype that will exist with the PDEP (whether using pyarrow or numpy object dtype for the data under the hood, and whether using NA or NaN as missing value sentinel).

I agree that typical users should not make that choice, but if we use the nature and missing concept, and have default values for those parameters, then they don't have to make the choice, unless they choose to do so.

@lithomas1
Copy link
Member

FWIW, there is also a fourth option: not have any keyword for this, and don't give users a way to control this through StringDtype(). Calling the default pd.StringDtype() will under the hood still create a dtype instance either backed by numpy object or pyarrow depending on whether pyarrow is installed, but that choice would then always be done automatically. And we can still have a private API to create the dtype with one of the specific backends for testing.

Big +1 on this.

I don't think there's a good way to resolve the ambiguity in a name like "pyarrow_numpy", and since pyarrow_numpy and the python fallback are going to be the default string dtype anyways, it's not going to be a common usecase to convert to it.

So, I would be fine making the users manually specify the whatever keywords that we decide on to create the pyarrow/python backed string array with np.nan as the missing value, and having no way to create e.g. a pyarrow_numpy array with a string alias.

I lean towards storage+na_value.

+1 on this.

There's a precedent for storage for string arrays, and na_value has history in things like read_csv.

I would be against adding a new keyword that clashes with "storage" (or changing "storage"), since it makes for messy handling internally. I also don't think it's worth the churn.

I'm less opinionated about na_value.

@WillAyd
Copy link
Member

WillAyd commented May 7, 2024

I realize this is for PDEP 14 which we want as a fast mover, but PDEP 13 proposed the following structure for a data type:

class BaseType:

    @property
    def dtype_backend -> Literal["pandas", "numpy", "pyarrow"]:
        """
        Library is responsible for the array implementation
        """
        ...

    @property
    def physical_type:
        """
        How does the backend physically implement this logical type? i.e. our
        logical type may be a "string" and we are using pyarrow underneath -
        is it a pa.string(), pa.large_string(), pa.string_view() or something else?
        """
        ...

    @property
    def missing_value_marker -> pd.NA|np.nan:
        """
        Sentinel used to denote missing values
        """
        ...

Which may be of interest here too (though feedback so far is that missing_value_marker is better called na_value).

It might be useful to separate the idea of the backend / provider (e.g. "pyarrow") from how that backend / provider is actually implementing things (e.g. "string", "large_string", "string_view"). @Dr-Irv I think you are going down that path with your "nature" suggestion but rather than trying to cram all that metadata into a single string having two properties might be more future proof

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 7, 2024

It might be useful to separate the idea of the backend / provider (e.g. "pyarrow") from how that backend / provider is actually implementing things (e.g. "string", "large_string", "string_view"). @Dr-Irv I think you are going down that path with your "nature" suggestion but rather than trying to cram all that metadata into a single string having two properties might be more future proof

Yes, the fact that a pyarrow backend could use a different physical type adds another aspect. So backend and physical_type might make sense to have instead of "nature"

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 8, 2024

Thanks for the discussion here. I would like to do a next iteration of my proposal, based on:

  • I would like to keep the changes as minimal as possible for 3.0. For example, we already have a storage keyword, so let's just stick to that for 3.0.
    In the logical dtypes discussion, we can still decide to generalize that for all dtypes, potentially with a different name like backend or nature, and at that point we can just alias or eventually deprecate storage for StringDtype (something we would otherwise have to do now as well, so let's just leave that for the logical dtypes discussion)
  • The easiest way to have a minimal distinction between the NA and NaN variants is to add one keyword for that (I agree now that my suggestion of having both storage and backend keywords, the one for the NA variants and the other for NaN, would be quite confusing).
    And given we already use na_value as an attribute on the dtypes right now, that seems the most obvious choice as others have also argued.
  • We don't yet have different physical types for one backend (with StringDtype("pyarrow"), you always get pyarrow's large_string), so again that is something we can leave for the logical dtypes discussion, and there is no need to already add a keyword for that right now.

That leads me to the following table (the first column is how the user can create a dtype, the second column the concrete dtype isntance they would get, and the third column the string alias they see in displays and can use as dtype specification in addition to the first column):

User specification Concrete dtype String alias Note
Unspecified (inference) StringDtype("pyarrow"|"python", na_value=np.nan) "string" (1)
StringDtype() or "string" StringDtype("pyarrow"|"python", na_value=np.nan) "string" (1), (2)
StringDtype("pyarrow") StringDtype("pyarrow", na_value=np.nan) "string" (2)
StringDtype("python") StringDtype("python", na_value=np.nan) "string" (2)
StringDtype("pyarrow", na_value=pd.NA) StringDtype("pyarrow", na_value=pd.NA) "string[pyarrow]"
StringDtype("python", na_value=pd.NA) StringDtype("pyarrow", na_value=pd.NA) "string[python]"
StringDtype("pyarrow_numpy") StringDtype("pyarrow", na_value=np.nan) "string[pyarrow_numpy]" (3)

(1) You get "pyarrow" or "python" depending on pyarrow being installed.
(2) Those three rows are backwards incompatible (i.e. they work now but give you the NA-variant), but we could still do a deprecation warning about that in advance of changing it.
(3) Keep "pyarrow_numpy" temporarily because this is in main, but deprecate in 2.2.x and have removed for 3.0

Additional notes on the string aliases:

  • I would explicitly not allow using "string[pyarrow]" and "string[python]" string aliases to create the new default NaN-variants, but only allow "string".
    Reasons:
    • 1) users should never specify it but rely on the default selection of backend (if you do specify it, that could make your code non-portable to an environment without pyarrow), and there is still the StringDtype(..) way to be explicit in case you need it.
    • 2) that would also allow to keep the existing string alias like "string[pyarrow]" backwards compatible (although that could also be confusing that this gives the NA-variant, so we might still want to deprecate that regardless).
    • Finally, this also avoids having the need to encode the NA/NaN value in the string alias to be able to have a distinct string alias for all variants.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 8, 2024

That leads me to the following table (the first column is how the user can create a dtype, the second column the concrete dtype isntance they would get, and the third column the string alias they see in displays and can use as dtype specification in addition to the first column):

I think the rows annotated with footnote (2) are problematic, because it is a change in behavior for people currently using StringDtype(), or StringDtype("pyarrow") or StringDtype("python"), i.e., you are now using np.nan as the default NA value, rather than pd.NA, which is there today.

Here's another idea. What if we deprecate the top-level namespace specifications for dtype, and move all of them to a pandas.dtype package, i.e., pandas.dtype.string, pandas.dtype.int64, etc. If you then use StringDtype(), you get the current behavior (pd.NA for missing values). But if you use pd.dtype.string, you get what you propose above. Then the only people affected by behavior changes are ones who specified dtype="string", because that would now use np.nan rather than pd.NA.

@jorisvandenbossche
Copy link
Member Author

I think the rows annotated with footnote (2) are problematic, because it is a change in behavior for people currently using StringDtype(), or StringDtype("pyarrow") or StringDtype("python")

Yes, but as discussed on the PDEP, we could still add a deprecation warning for it. I know that doesn't change that it still is a behaviour change (and users will only see the deprecation warning for a short time), but at least we could do it with some warning in advance.

Further, my guess is that "string" will be used more often (that guess if based on the usage in our own documentation, and from a quick search on StackOverflow questions labeled with pandas for "StringDtype") than StringDtype, and certainly as the ones with an explicit keyword.

So if we are eventually on board with changing the behaviour for "string", I think changing StringDtype() as well is OK.

Further, I think quite some users that now do StringDtype don't necessarily need the NA-variant, but would be perfectly fine with the NaN-variant. And so that would also avoid them to change their code.
(i.e. only those who explicitly want to work with the nullable NA dtypes will have to make their dtype specification more explicit)

@jorisvandenbossche
Copy link
Member Author

Here's another idea

And to be clear, I think that is a good idea long-term, but I would personally keep that for when we are ready to do that for all dtypes, instead of now only having a string dtype in such a namespace (while other default dtypes like categorical, datetimetz, etc still live top-level)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 8, 2024

Further, I think quite some users that now do StringDtype don't necessarily need the NA-variant, but would be perfectly fine with the NaN-variant. And so that would also avoid them to change their code.
(i.e. only those who explicitly want to work with the nullable NA dtypes will have to make their dtype specification more explicit)

Given what @WillAyd wrote here: #58551 (comment) I think we need to be careful about this.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 8, 2024

And to be clear, I think that is a good idea long-term, but I would personally keep that for when we are ready to do that for all dtypes, instead of now only having a string dtype in such a namespace (while other default dtypes like categorical, datetimetz, etc still live top-level)

But if we did this for all dtypes as part of the change for strings (and deprecated the top level dtypes), then we accomplish both goals that includes a better migration path for strings (maybe).

@jorisvandenbossche
Copy link
Member Author

Further, I think quite some users that now do StringDtype don't necessarily need the NA-variant, ....

Given what @WillAyd wrote here: #58551 (comment) I think we need to be careful about this.

I think many of the posts about the StringDtype are not necessarily about the aspect how NA is different from NaN, but just about how cool it is to have an actual string dtype, instead of the confusing-catch-all-object-dtype, and (in case it's about the pyarrow variant) the performance improvements.
I can't actually check the medium blogpost (it's member-only), but at least the SO question is just about converting to strings. And for example https://pythonspeed.com/articles/pandas-string-dtype-memory/ is about memory improvement. https://park.is/notebooks/comparing-pandas-string-dtypes/ does an in-depth comparison but actually doesn't really mention the difference in missing value semantics (it does mention missing values, but to show how the string dtype no longer converts missing values to its string repr, in contrast to astype(str) with object dtype). (that were one of the first hits for blog posts in a google search on "pandas string dtype")

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 8, 2024

Consider this code (using 2.2):

>>> s = pd.Series(["a", "b", "c"], dtype="string[pyarrow_numpy]")
>>> s
0    a
1    b
2    c
dtype: string
>>> s.str.len()
0    1
1    1
2    1
dtype: int64
>>> s2 = pd.Series(["a", "b", "c"], dtype="string")
>>> s2
0    a
1    b
2    c
dtype: string
>>> s2.str.len()
0    1
1    1
2    1
dtype: Int64
>>> s.shift(1)
0    NaN
1      a
2      b
dtype: string
>>> s2.shift(1)
0    <NA>
1       a
2       b
dtype: string
>>> s2.shift(1).str.len()
0    <NA>
1       1
2       1
dtype: Int64
>>> s.shift(1).str.len()
0    NaN
1    1.0
2    1.0
dtype: float64

If we adopt your proposal, then if you have np.nan in your series of strings, and take the length, you get a float series. I've found this annoying in the past.

I'm not saying this is a reason to not adopt this proposal, but just wanted to point out this behavior.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 8, 2024

For StringDtype("pyarrow", na_value=pd.NA) and StringDtype("python", na_value=pd.NA), can we add the alias "String"

Also, for StringDtype, is the first argument required? Could you just do StringDtype(na_value=pd.NA), and then it will use pyarrow if installed, otherwise python?

@WillAyd
Copy link
Member

WillAyd commented May 8, 2024

Also, for StringDtype, is the first argument required? Could you just do StringDtype(na_value=pd.NA), and then it will use pyarrow if installed, otherwise python?

I think this is going to be really problematic. While I like the spirit of having one StringDtype with potentially different underlying implementations depending on what is installed, I think this is just going to open another can of worms:

>>> ser = pd.Series(["x", "y", None], dtype=pd.StringDtype(na_value=pd.NA)
>>> ser.str.len()

What data type gets returned here if arrow is installed? "pyarrow[int64]" maybe makes sense, but then if the user for whatever reason uninstalls pyarrow does the same code now return "Int64"? What if they changed the na_value to:

>>> ser = pd.Series(["x", "y", None], dtype=pd.StringDtype()

Whether or not we have pyarrow installed I assume this returns a float (?)

I am definitely not a fan of our current "string[pyarrow_numpy]" but to its credit it is at least consistent in the data types it returns

With PDEP-13 my expectation would be that regardless of what is installed:

>>> ser = pd.Series(["x", "y", None], dtype=pd.StringDtype(na_value=<na_value>)
>>> ser.str.len()
0 1
1 1
2 <na_value>
dtype: pd.Int64()

so I think abstracts a lot of these issues. But I don't think only partially implementing that for a StringDtype is going to get us to a better place

@jorisvandenbossche
Copy link
Member Author

I think this is going to be really problematic. ..
What data type gets returned here if arrow is installed? "pyarrow[int64]" maybe makes sense, but then if the user for whatever reason uninstalls pyarrow does the same code now return "Int64"?

@WillAyd no, this is not correct. Or at least, that's not how the existing StringDtype (with pd.NA) currently works, and we are not proposing to change anything about that right now.

If you use StringDtype(storage="python"|"pyarrow") (i.e. the versions that use pd.NA and that have already been around since 1.x), you always get back our nullable/masked arrays, and so for this example of len() you always get Int64 dtype result, regardless of using pyarrow under the hood or not.

(it's only when you use pd.ArrowDtype(pa.string()) that you would get a pyarrow backed integer dtype)

If we adopt your proposal, then if you have np.nan in your series of strings, and take the length, you get a float series. I've found this annoying in the past.

Yes, you will get a float, just as you get a float result right now with object dtype (nothing changes there compared to the status quo). And yes this is annoying, but that is not limited to strings, but this is annoying whenever some operation casts ints to floats because of the introduction of missing values and our default integer dtypes not supporting missing values.


But this is not really any more about the naming but behaviour, so let's continue those discussions about behaviour in the PDEP PR #58551 itself

@jorisvandenbossche
Copy link
Member Author

For StringDtype("pyarrow", na_value=pd.NA) and StringDtype("python", na_value=pd.NA), can we add the alias "String"

As mentioned on the call, if people think this would help, I am certainly fine with adding that alias. It is 1) consistent with the current string aliases of the other "nullable" (using pd.NA) Int/Float/Boolean dtypes, and 2) it makes the code edits to keep using the NA-variant of StringDtype a bit smaller (i.e. a user can change dtype="string" to dtype="String" in addition to dtype=pd.StringDtype(na_value=pd.NA)).

I think the main downside is that it makes the string aliases even more confusing. And also, the existing "string[python]" and "string[pyarrow]" aliases, which I personally would keep working as is, i.e. being aliases for the NA-variant (see the note at the bottom of the comment at #58613 (comment)), would then also be inconsistent with "String", unless we add "String[python]" and "String[pyarrow]".

Also, for StringDtype, is the first argument required? Could you just do StringDtype(na_value=pd.NA), and then it will use pyarrow if installed, otherwise python?

That sounds good to me. Currently that does not happen automatically: the default was simply storage="python", but this could be overridden with pd.options.mode.string_storage = "pyarrow" to globally use storage="pyarrow" as the default for StringDtype().

But I think it certainly makes sense to change this, and let it follow the "choose the default depending on whether pyarrow is installed" logic for the NaN-variant of the dtype (if the user did not set that string_storage option).
We want to make sure the user can continue to use the NA-variant of the dtype, without having to be specific about using pyarrow or not, and enabling pd.StringDtype(na_value=pd.NA) should then allow that.

@WillAyd
Copy link
Member

WillAyd commented May 13, 2024

If you use StringDtype(storage="python"|"pyarrow") (i.e. the versions that use pd.NA and that have already been around since 1.x), you always get back our nullable/masked arrays, and so for this example of len() you always get Int64 dtype result, regardless of using pyarrow under the hood or not.

Ah OK - I thought this PDEP wanted to re-use the functionality of string[pyarrow_numpy], which wouldn't support pd.NA, hence my original question about the return types (I was assuming an np.nan sentinel would match the same "string[pyarrow_numpy]" types).

I am on board with the default string type continuing to return extension types regardless of what is installed (that aligns with the PDEP-13 proposal), although my remaining hang up I think would just be not changing the nullability semantics for that StringDtype by default

@WillAyd
Copy link
Member

WillAyd commented May 13, 2024

I guess just to be clear, both StringDtype(na_value=np.nan) and StringDtype(na_value=pd.NA) are supposed to return Int64Dtype() for ser.str.len() in this proposal right?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 13, 2024

No, the StringDtype(na_value=np.nan) variants will return our default integer dtype, i.e. int64 (or float64 when missing values are present), and the StringDtype(na_value=pd.NA) variants will return the nullable masked ones, i.e. Int64Dtype() for len().

And in both cases, I say "variants" because this is true regardless of whether it is using pyarrow under the hood (storage="pyarrow") or not (storage="python").

So this means that the return type of operations does not depend on some library being installed or not, and only depends on the user explicitly opting in to NA-dtypes (by default, a user only gets default data types using NaN semantics).

This last item is what I tried to clarify above, because in your comment you said "What data type gets returned here if arrow is installed? "pyarrow[int64]" maybe makes sense, but then if the user for whatever reason uninstalls pyarrow does the same code now return "Int64"?", and I wanted to clarify that the return type never depends on whether it uses pyarrow under the hood or not.

(but again, this is more related to the actual proposed behaviour of the PDEP and not the naming scheme. So let's continue this discussion on the PDEP PR itself. I just pushed a commit to expand on this missing value semantics in 54a43b3)

@WillAyd
Copy link
Member

WillAyd commented May 13, 2024

Ah...sorry I was just replying to the github notification. Though this was the PDEP itself all this time...will try to port over that conversation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

6 participants