Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_html has unexpected behavior parsing th & td with colspan attribute. #56591

Closed
3 tasks done
FawzyMokhtar opened this issue Dec 21, 2023 · 6 comments
Closed
3 tasks done
Labels
Bug Closing Candidate May be closeable, needs more eyeballs IO HTML read_html, to_html, Styler.apply, Styler.applymap Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@FawzyMokhtar
Copy link

FawzyMokhtar commented Dec 21, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

htmlTable = """
<table>
  <thead>
    <tr>
     <th colspan="3">Header 1</th>
     <th colspan="3">Header 2</th>
     <th colspan="3">Header 3</th>
    </tr>
  </thead>
  <tbody>
   <tr>
    <td>1</td>
    <td>2</td>
    <td>3</td>
    <td>4</td>
    <td>5</td>
    <td>6</td>
    <td>7</td>
    <td>8</td>
    <td>9</td>
   </tr>
  </tbody>
</table>
"""
df = pd.read_html(StringIO(htmlTable), index_col=0, keep_default_na=False)[0]
print(df)

Issue Description

Here is the output:

Header 1   Header 1.1   Header 1.2  Header 2  Header 2.1  Header 2.2  Header 3  Header 3.1  Header 3.2
1          2            3           4         5           6           7         8           9

Expected Behavior

Header 1                            Header 2                          Header 3
1          2            3           4         5           6           7         8           9

Note:
In the real scenario I get the duplicated header names as 'Unnamed: 1,2,3'.

Example:
image

Installed Versions

python: 3.11.6.final.0
pip: 23.2.1
pandas : 2.1.1 or 2.1.4
numpy: 1.26.0
xlsxwriter: 3.1.7
lxml.etree : 4.9.3
html5lib: 1.1
bs4: 4.12.2
@FawzyMokhtar FawzyMokhtar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 21, 2023
@naman8989
Copy link

Please can you assign this issue to me

@FawzyMokhtar
Copy link
Author

Please can you assign this issue to me

@naman8989 This option is not visible to me.

@rhshadrach
Copy link
Member

It looks like in your expected output, you are expecting to have multiple columns with the label of None, is that right? Working with DataFrames with duplicate column labels can be very difficult, I don't think this should be the behavior of read_html.

In the real scenario I get the duplicated header names as 'Unnamed: 1,2,3'

I'm guessing you haven't been able to find a reproducible example that does this, is that right?

@rhshadrach rhshadrach added IO HTML read_html, to_html, Styler.apply, Styler.applymap Closing Candidate May be closeable, needs more eyeballs labels Dec 22, 2023
@FawzyMokhtar
Copy link
Author

My question is, Why pandas.read_html then export the parsed DataFrame into excel, doesn't respect the colspan attributes.

Can you confirm that, The reproducible example will be parsed then exported to excel & give the same result as if it was rendered as HTML table and colspaned cells will be merged?

image

@rhshadrach
Copy link
Member

rhshadrach commented Dec 27, 2023

Why pandas.read_html then export the parsed DataFrame into excel, doesn't respect the colspan attributes.

There are two separate operations - not all users will take the result of read_html and put it into Excel. We can't modify the behavior of read_html in this way without hurting other operations pandas users might do with the results.

Can you confirm that, The reproducible example will be parsed then exported to excel & give the same result as if it was rendered as HTML table and colspaned cells will be merged?

No - that does not give the same result. But I also don't think it should.

@mroeschke
Copy link
Member

Agreed that the existing behavior is intended. Thanks for the report so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs IO HTML read_html, to_html, Styler.apply, Styler.applymap Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

4 participants