Delta Table Source Connector

Objectives

  1. Extract text and metadata from a PDF file using the Unstructured.io Python SDK.

  2. Process and store this data in a Databricks Delta Table.

  3. Retrieve data from the Delta Table using the Unstructured.io Delta Table Connector.

Prerequisites

  • Unstructured Python SDK

  • Databricks account and workspace

  • AWS S3 for Delta Table storage

Extracting PDF Using Unstructured Python SDK

  1. Install Unstructured Python SDK

pip install unstructuredio-sdk
  1. Code Example

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

s = UnstructuredClient(
    security=shared.Security(
        api_key_auth=UNSTRUCTURED_API_KEY, # replace with your own API key
    ),
)

req = shared.PartitionParameters(
    # Note that this currently only supports a single file
    files=shared.PartitionParametersFiles(
        content=file.read(),
        files=filename,
    ),
    # Other partition params
    strategy="hi_res",
    chunking_strategy="by_title",
)

Processing and Storing into Databricks Delta Table

  1. Initialize PySpark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  1. Convert JSON output into Dataframe

import pyspark

dataframe = spark.createDataFrame(res.elements)
  1. Store DataFrame as Delta Table

dataframe.write.mode("overwrite").format("delta").saveAsTable("delta_table")

Extracting Delta Table Using Unstructured Connector

  1. Install Unstructured Connector Dependency

pip install "unstructured[delta-table]"
  1. Command Line Execution

unstructured-ingest \
    delta-table \
    --table-uri <<REPLACE WITH S3 URI>> \
    --output-dir delta-table-example \
    --storage_options "AWS_REGION=us-east-2, \
                       AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID, \
                       AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
    --verbose

Conclusion

This documentation covers the essential steps for converting unstructured PDF data into structured data and storing it in a Databricks Delta Table. It also outlines how to extract this data for further use.