Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add totality validation to merge method #58547

Open
1 of 3 tasks
z3rone opened this issue May 3, 2024 · 1 comment · May be fixed by #58600
Open
1 of 3 tasks

ENH: Add totality validation to merge method #58547

z3rone opened this issue May 3, 2024 · 1 comment · May be fixed by #58600
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@z3rone
Copy link

z3rone commented May 3, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The available validation methods lack checks for (left-/right-)totality. I am frequently encountering cases where I need to manually check that eg. a one-to-one merge also finds a match match in the right DF for every row in the left DF or vice versa.

Feature Description

Add the following to one_to_one, one_to_many and many_to_one merge validations:

  • left_total ... Each row in the left DataFrame is matched to (at least) one row in the right DataFrame
  • right_total ... Each row in the right DataFrame is matched to (at least) one row in the left DataFrame
  • total ... Both left_total and right_total must hold

A combination of join relation and totality constraint should be possible by combining with a +: one_to_one+left_total

Alternative Solutions

Currently, doing an outer join and checking for NaN values in the "foreign" columns works to find unmerged rows. However, this will fail if there are already NaN values in the initial DataFrames.

Additional Context

No response

@z3rone z3rone added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 3, 2024
z3rone pushed a commit to z3rone/pandas that referenced this issue May 4, 2024
@z3rone z3rone linked a pull request May 6, 2024 that will close this issue
5 tasks
z3rone pushed a commit to z3rone/pandas that referenced this issue May 6, 2024
@z3rone
Copy link
Author

z3rone commented May 9, 2024

To maybe add a common use case. Here the goal is to add the biological domain to the favorite animal of certain people:

import pandas as pd

# Create the first DataFrame with person names and favorite animals
df1_data = {
    'Person': ['John', 'Emma', 'Alex','Darleen'],
    'Animal': ['Dog', 'Spider', 'Snake','Cat']
}
df1 = pd.DataFrame(df1_data)

# Create the second DataFrame with mapping of animals to biological class
df2_data = {
    'Animal': ['Dog', 'Snake', 'Cat'],
    'Biological_Class': ['Mammal', 'Reptile', 'Mammal']
}
df2 = pd.DataFrame(df2_data)

# Merge the DataFrames on the 'Animal' column
merged_df = pd.merge(
    df1,
    df2,
    on='Animal',
    validate='m:1'
)

The merged_df will lack the favorite animal of Emma, as 'Spider' has no class defined in df2. With the proposed feature validate could be set to m:1+left_total. This would raise an error as not all keys from the left df1 are contained in the right df2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant