BUG: read_html has unexpected behavior parsing th & td with colspan attribute. #56591

FawzyMokhtar · 2023-12-21T20:27:42Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

htmlTable = """
<table>
  <thead>
    <tr>
     <th colspan="3">Header 1</th>
     <th colspan="3">Header 2</th>
     <th colspan="3">Header 3</th>
    </tr>
  </thead>
  <tbody>
   <tr>
    <td>1</td>
    <td>2</td>
    <td>3</td>
    <td>4</td>
    <td>5</td>
    <td>6</td>
    <td>7</td>
    <td>8</td>
    <td>9</td>
   </tr>
  </tbody>
</table>
"""
df = pd.read_html(StringIO(htmlTable), index_col=0, keep_default_na=False)[0]
print(df)

Issue Description

Here is the output:

Header 1   Header 1.1   Header 1.2  Header 2  Header 2.1  Header 2.2  Header 3  Header 3.1  Header 3.2
1          2            3           4         5           6           7         8           9

Expected Behavior

Header 1                            Header 2                          Header 3
1          2            3           4         5           6           7         8           9

Note:
In the real scenario I get the duplicated header names as 'Unnamed: 1,2,3'.

Example:

Installed Versions

python: 3.11.6.final.0
pip: 23.2.1
pandas : 2.1.1 or 2.1.4
numpy: 1.26.0
xlsxwriter: 3.1.7
lxml.etree : 4.9.3
html5lib: 1.1
bs4: 4.12.2

The text was updated successfully, but these errors were encountered:

naman8989 · 2023-12-22T09:35:41Z

Please can you assign this issue to me

FawzyMokhtar · 2023-12-22T12:58:19Z

Please can you assign this issue to me

@naman8989 This option is not visible to me.

rhshadrach · 2023-12-22T13:22:52Z

It looks like in your expected output, you are expecting to have multiple columns with the label of None, is that right? Working with DataFrames with duplicate column labels can be very difficult, I don't think this should be the behavior of read_html.

In the real scenario I get the duplicated header names as 'Unnamed: 1,2,3'

I'm guessing you haven't been able to find a reproducible example that does this, is that right?

FawzyMokhtar · 2023-12-22T13:38:36Z

My question is, Why pandas.read_html then export the parsed DataFrame into excel, doesn't respect the colspan attributes.

Can you confirm that, The reproducible example will be parsed then exported to excel & give the same result as if it was rendered as HTML table and colspaned cells will be merged?

rhshadrach · 2023-12-27T15:28:25Z

Why pandas.read_html then export the parsed DataFrame into excel, doesn't respect the colspan attributes.

There are two separate operations - not all users will take the result of read_html and put it into Excel. We can't modify the behavior of read_html in this way without hurting other operations pandas users might do with the results.

Can you confirm that, The reproducible example will be parsed then exported to excel & give the same result as if it was rendered as HTML table and colspaned cells will be merged?

No - that does not give the same result. But I also don't think it should.

mroeschke · 2024-05-19T18:38:17Z

Agreed that the existing behavior is intended. Thanks for the report so closing

FawzyMokhtar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 21, 2023

rhshadrach added IO HTML read_html, to_html, Styler.apply, Styler.applymap Closing Candidate May be closeable, needs more eyeballs labels Dec 22, 2023

mroeschke closed this as completed May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_html has unexpected behavior parsing th & td with colspan attribute. #56591

BUG: read_html has unexpected behavior parsing th & td with colspan attribute. #56591

FawzyMokhtar commented Dec 21, 2023 •

edited

naman8989 commented Dec 22, 2023

FawzyMokhtar commented Dec 22, 2023

rhshadrach commented Dec 22, 2023

FawzyMokhtar commented Dec 22, 2023

rhshadrach commented Dec 27, 2023 •

edited

mroeschke commented May 19, 2024

BUG: read_html has unexpected behavior parsing th & td with colspan attribute. #56591

BUG: read_html has unexpected behavior parsing th & td with colspan attribute. #56591

Comments

FawzyMokhtar commented Dec 21, 2023 • edited

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

naman8989 commented Dec 22, 2023

FawzyMokhtar commented Dec 22, 2023

rhshadrach commented Dec 22, 2023

FawzyMokhtar commented Dec 22, 2023

rhshadrach commented Dec 27, 2023 • edited

mroeschke commented May 19, 2024

FawzyMokhtar commented Dec 21, 2023 •

edited

rhshadrach commented Dec 27, 2023 •

edited