SQL
NOTE: At the moment, the connector only supports PostgreSQL and SQLite. Vectors can be stored and searched in PostgreSQL with pgvector.
Batch process all your records using unstructured-ingest
to store structured outputs locally on your filesystem and upload those local files to a PostgreSQL or SQLite schema.
Insert query is currently limited to append.
First you’ll need to install the sql dependencies as shown here if you are using PostgreSQL.
pip install "unstructured[postgres]"
Run Locally
The upstream connector can be any of the ones supported, but for convenience here, showing a sample command using the upstream local connector.
#!/usr/bin/env bash
unstructured-ingest \
local \
--input-path example-docs/fake-memo.pdf \
--anonymous \
--output-dir local-output-to-mongo \
--num-processes 2 \
--verbose \
--strategy fast \
sql \
--db-type postgresql \
--username postgres \
--password test \
--host localhost \
--port 5432 \
--database elements
import os
from unstructured.ingest.interfaces import PartitionConfig, ProcessorConfig, ReadConfig
from unstructured.ingest.runner import LocalRunner
if __name__ == "__main__":
runner = LocalRunner(
processor_config=ProcessorConfig(
verbose=True,
output_dir="local-output-to-postgres",
num_processes=2,
),
read_config=ReadConfig(),
partition_config=PartitionConfig(),
writer_type="sql",
writer_kwargs={
"db_type": os.getenv("DB_TYPE"),
"username": os.getenv("USERNAME"),
"password": os.getenv("DB_PASSWORD"),
"host": os.getenv("DB_HOST"),
"port": os.getenv("DB_PORT"),
"database": os.getenv("DB_DATABASE"),
},
)
runner.run(
input_path="example-docs/fake-memo.pdf",
)
For a full list of the options the CLI accepts check unstructured-ingest <upstream connector> sql --help
.
NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you’re running this locally. You can find more information about this in the installation guide.
Sample Index Schema
To make sure the schema of the index matches the data being written to it, a sample schema json can be used.
1CREATE TABLE elements (
2 id UUID PRIMARY KEY,
3 element_id VARCHAR,
4 text TEXT,
5 embeddings DECIMAL [],
6 type VARCHAR,
7 system VARCHAR,
8 layout_width DECIMAL,
9 layout_height DECIMAL,
10 points TEXT,
11 url TEXT,
12 version VARCHAR,
13 date_created TIMESTAMPTZ,
14 date_modified TIMESTAMPTZ,
15 date_processed TIMESTAMPTZ,
16 permissions_data TEXT,
17 record_locator TEXT,
18 category_depth INTEGER,
19 parent_id VARCHAR,
20 attached_filename VARCHAR,
21 filetype VARCHAR,
22 last_modified TIMESTAMPTZ,
23 file_directory VARCHAR,
24 filename VARCHAR,
25 languages VARCHAR [],
26 page_number VARCHAR,
27 links TEXT,
28 page_name VARCHAR,
29 link_urls VARCHAR [],
30 link_texts VARCHAR [],
31 sent_from VARCHAR [],
32 sent_to VARCHAR [],
33 subject VARCHAR,
34 section VARCHAR,
35 header_footer_type VARCHAR,
36 emphasized_text_contents VARCHAR [],
37 emphasized_text_tags VARCHAR [],
38 text_as_html TEXT,
39 regex_metadata TEXT,
40 detection_class_prob DECIMAL
41);
1CREATE EXTENSION vector;
2
3CREATE TABLE elements (
4 id UUID PRIMARY KEY,
5 element_id VARCHAR,
6 text TEXT,
7 embeddings vector(384),
8 type VARCHAR,
9 system VARCHAR,
10 layout_width DECIMAL,
11 layout_height DECIMAL,
12 points TEXT,
13 url TEXT,
14 version VARCHAR,
15 date_created TIMESTAMPTZ,
16 date_modified TIMESTAMPTZ,
17 date_processed TIMESTAMPTZ,
18 permissions_data TEXT,
19 record_locator TEXT,
20 category_depth INTEGER,
21 parent_id VARCHAR,
22 attached_filename VARCHAR,
23 filetype VARCHAR,
24 last_modified TIMESTAMPTZ,
25 file_directory VARCHAR,
26 filename VARCHAR,
27 languages VARCHAR [],
28 page_number VARCHAR,
29 links TEXT,
30 page_name VARCHAR,
31 link_urls VARCHAR [],
32 link_texts VARCHAR [],
33 sent_from VARCHAR [],
34 sent_to VARCHAR [],
35 subject VARCHAR,
36 section VARCHAR,
37 header_footer_type VARCHAR,
38 emphasized_text_contents VARCHAR [],
39 emphasized_text_tags VARCHAR [],
40 text_as_html TEXT,
41 regex_metadata TEXT,
42 detection_class_prob DECIMAL
43);
1CREATE TABLE elements (
2 id TEXT PRIMARY KEY,
3 element_id TEXT,
4 text TEXT,
5 embeddings TEXT,
6 type TEXT,
7 system TEXT,
8 layout_width REAL,
9 layout_height REAL,
10 points TEXT,
11 url TEXT,
12 version TEXT,
13 date_created TEXT,
14 date_modified TEXT,
15 date_processed TEXT,
16 permissions_data TEXT,
17 record_locator TEXT,
18 category_depth INTEGER,
19 parent_id TEXT,
20 attached_filename TEXT,
21 filetype TEXT,
22 last_modified TEXT,
23 file_directory TEXT,
24 filename TEXT,
25 languages TEXT,
26 page_number TEXT,
27 links TEXT,
28 page_name TEXT,
29 link_urls TEXT,
30 link_texts TEXT,
31 sent_from TEXT,
32 sent_to TEXT,
33 subject TEXT,
34 section TEXT,
35 header_footer_type TEXT,
36 emphasized_text_contents TEXT,
37 emphasized_text_tags TEXT,
38 text_as_html TEXT,
39 regex_metadata TEXT,
40 detection_class_prob DECIMAL
41);