Table Extraction from PDF
This section describes two methods for extracting tables from PDF files.
Note
To extract tables from any documents, set the strategy
parameter to hi_res
for both methods below.
Method 1: Using partition_pdf
To extract the tables from PDF files using the partition_pdf, set the infer_table_structure
parameter to True
and strategy
parameter to hi_res
.
Usage
from unstructured.partition.pdf import partition_pdf
fname = "example-docs/layout-parser-paper.pdf"
elements = partition_pdf(filename=fname,
infer_table_structure=True,
strategy='hi_res',
)
tables = [el for el in elements if el.category == "Table"]
print(tables[0].text)
print(tables[0].metadata.text_as_html)
Method 2: Using Auto Partition or Unstructured API
By default, table extraction from all file types is enabled. To extract tables from PDFs and images using Auto Partition or Unstructured API parameters simply set strategy
parameter to hi_res
.
Usage: Auto Partition
from unstructured.partition.auto import partition
filename = "example-docs/layout-parser-paper.pdf"
elements = partition(filename=filename,
strategy='hi_res',
)
tables = [el for el in elements if el.category == "Table"]
print(tables[0].text)
print(tables[0].metadata.text_as_html)
Usage: API Parameters
curl -X 'POST' \
'https://api.unstructured.io/general/v0/general' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'strategy=hi_res' \
| jq -C . | less -R