Multi-files API Processing

Introduction

This guide demonstrates how to process multiple files using the Unstructured API and S3 Connector and implement context-aware chunking. The process involves installing dependencies, configuring settings, and utilizing Python scripts to manage and chunk data effectively.

Prerequisites

Ensure you have Unstructured API key and access to an S3 bucket containing the target files.

Step-by-Step Process

Step 1: Install Unstructured and S3 Dependency

Install the unstructured package with S3 support.

pip install "unstructured[s3]"

Step 2: Import Libraries

Import necessary libraries from the unstructured package for chunking and S3 processing.

from unstructured.ingest.interfaces import (
    FsspecConfig,
    PartitionConfig,
    ProcessorConfig,
    ReadConfig,
)
from unstructured.ingest.runner import S3Runner

from unstructured.chunking.title import chunk_by_title
from unstructured.staging.base import dict_to_elements

Step 3: Configuration

Set up the API key and S3 URL for accessing the data.

UNSTRUCTURED_API_KEY = os.getenv('UNSTRUCTURED_API_KEY')
S3_URL = "s3://rh-financial-reports/world-development-bank-2023/"

Step 4: Python Runner

Configure and run the S3Runner for processing the data.

runner = S3Runner(
     processor_config=ProcessorConfig(
         verbose=True,
         output_dir="Connector-Output",
         num_processes=8,
     ),
     read_config=ReadConfig(),
     partition_config=PartitionConfig(
         partition_endpoint="https://api.unstructured.io/general/v0/general",
         partition_by_api=True,
         api_key=UNSTRUCTURED_API_KEY,
         strategy="hi_res",
         hi_res_model_name="yolox",
     ),
     fsspec_config=FsspecConfig(
         remote_url=S3_URL,
     ),
 )

runner.run(anonymous=True)

Step 5: Combine JSON Files from Multi-files Ingestion

Combine JSON files into a single dataset for further processing.

combined_json_data = read_and_combine_json("Connector-Output/world-development-bank-2023")

Step 6: Convert into Unstructured Elements for Chunking

Convert the combined JSON data into Unstructured Elements and apply chunking by title.

elements = dict_to_elements(combined_json_data)
chunks = chunk_by_title(elements)

Conclusion

Following these steps allows for efficient processing of multiple files using the Unstructured S3 Connector. The context-aware chunking helps in organizing and analyzing the data effectively.