Delta Table Source Connector
Objectives
Extract text and metadata from a PDF file using the Unstructured.io Python SDK.
Process and store this data in a Databricks Delta Table.
Retrieve data from the Delta Table using the Unstructured.io Delta Table Connector.
Prerequisites
Unstructured Python SDK
Databricks account and workspace
AWS S3 for Delta Table storage
Extracting PDF Using Unstructured Python SDK
Install Unstructured Python SDK
pip install unstructuredio-sdk
Code Example
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
s = UnstructuredClient(
security=shared.Security(
api_key_auth=UNSTRUCTURED_API_KEY, # replace with your own API key
),
)
req = shared.PartitionParameters(
# Note that this currently only supports a single file
files=shared.PartitionParametersFiles(
content=file.read(),
files=filename,
),
# Other partition params
strategy="hi_res",
chunking_strategy="by_title",
)
Processing and Storing into Databricks Delta Table
Initialize PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
Convert JSON output into Dataframe
import pyspark
dataframe = spark.createDataFrame(res.elements)
Store DataFrame as Delta Table
dataframe.write.mode("overwrite").format("delta").saveAsTable("delta_table")
Extracting Delta Table Using Unstructured Connector
Install Unstructured Connector Dependency
pip install "unstructured[delta-table]"
Command Line Execution
unstructured-ingest \
delta-table \
--table-uri <<REPLACE WITH S3 URI>> \
--output-dir delta-table-example \
--storage_options "AWS_REGION=us-east-2, \
AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID, \
AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
--verbose
Conclusion
This documentation covers the essential steps for converting unstructured PDF data into structured data and storing it in a Databricks Delta Table. It also outlines how to extract this data for further use.