SQL

NOTE: At the moment, the connector only supports PostgreSQL and SQLite. Vectors can be stored and searched in PostgreSQL with pgvector.

Batch process all your records using unstructured-ingest to store structured outputs locally on your filesystem and upload those local files to a PostgreSQL or SQLite schema.

Insert query is currently limited to append.

First you’ll need to install the sql dependencies as shown here if you are using PostgreSQL.

pip install "unstructured[postgres]"

Run Locally

The upstream connector can be any of the ones supported, but for convenience here, showing a sample command using the upstream local connector.

#!/usr/bin/env bash

unstructured-ingest \
  local \
  --input-path example-docs/fake-memo.pdf \
  --anonymous \
  --output-dir local-output-to-mongo \
  --num-processes 2 \
  --verbose \
  --strategy fast \
  sql \
  --db-type postgresql \
  --username postgres \
  --password test \
  --host localhost \
  --port 5432 \
  --database elements

For a full list of the options the CLI accepts check unstructured-ingest <upstream connector> sql --help.

NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you’re running this locally. You can find more information about this in the installation guide.

Sample Index Schema

To make sure the schema of the index matches the data being written to it, a sample schema json can be used.

Object description
 1CREATE TABLE elements (
 2    id UUID PRIMARY KEY,
 3    element_id VARCHAR,
 4    text TEXT,
 5    embeddings DECIMAL [],
 6    type VARCHAR,
 7    system VARCHAR,
 8    layout_width DECIMAL,
 9    layout_height DECIMAL,
10    points TEXT,
11    url TEXT,
12    version VARCHAR,
13    date_created TIMESTAMPTZ,
14    date_modified TIMESTAMPTZ,
15    date_processed TIMESTAMPTZ,
16    permissions_data TEXT,
17    record_locator TEXT,
18    category_depth INTEGER,
19    parent_id VARCHAR,
20    attached_filename VARCHAR,
21    filetype VARCHAR,
22    last_modified TIMESTAMPTZ,
23    file_directory VARCHAR,
24    filename VARCHAR,
25    languages VARCHAR [],
26    page_number VARCHAR,
27    links TEXT,
28    page_name VARCHAR,
29    link_urls VARCHAR [],
30    link_texts VARCHAR [],
31    sent_from VARCHAR [],
32    sent_to VARCHAR [],
33    subject VARCHAR,
34    section VARCHAR,
35    header_footer_type VARCHAR,
36    emphasized_text_contents VARCHAR [],
37    emphasized_text_tags VARCHAR [],
38    text_as_html TEXT,
39    regex_metadata TEXT,
40    detection_class_prob DECIMAL
41);