Full Installation

Basic Usage

For a complete set of extras catering to every document type, use:

pip install "unstructured[all-docs]"

Installation for Specific Document Types

If you’re processing document types beyond the basics, you can install the necessary extras:

pip install "unstructured[docx,pptx]"

Available document types:

"csv", "doc", "docx", "epub", "image", "md", "msg", "odt", "org", "pdf", "ppt", "pptx", "rtf", "rst", "tsv", "xlsx"


Installation for Specific Data Connectors

To use any of the data connectors, you must install the specific dependency:

pip install "unstructured[s3]"

Available data connectors:

"airtable", "azure", "azure-cognitive-search", "biomed", "box", "confluence", "delta-table", "discord", "dropbox", "elasticsearch", "gcs", "github", "gitlab", "google-drive", "jira", "mongodb", "notion", "opensearch", "onedrive", "outlook", "reddit", "s3", "sharepoint", "salesforce", "slack", "wikipedia"

Installation with conda on Windows

You can install and run unstructured on Windows with conda, but the process involves a few extra steps. This section will help you get up and running.

  • Install Anaconda on your Windows machine.

  • Install Microsoft C++ Build Tools using the instructions in this Stackoverflow post. C++ build tools are required for the pycocotools dependency.

  • Run conda env create -f environment.yml using the environment.yml file in the unstructured repo to create a virtual environment. The environment will be named unstructured.

  • Run conda activate unstructured to activate the virtualenvironment.

  • Run pip install unstructured to install the unstructured library.

Setting up unstructured for local inference

If you need to run model inferences locally, there are a few additional steps you need to take. The main challenge is installing detectron2 for PDF layout parsing. detectron2 does not officially support Windows, but it is possible to get it to install on Windows. The installation instructions are based on the instructions LayoutParser provides here.

  • Run pip install pycocotools-windows to install a Windows compatible version of pycocotools. Alternatively, you can run pip3 install "git+https://github.com/philferriere/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI" as outlined in this GitHub issue.

  • Run git clone https://github.com/ivanpp/detectron2.git, then cd detectron2, then pip install -e . to install a Windows compatible version of the detectron2 library.

  • Install the a Windows compatible version of iopath using the instructions outlined in this GitHub issue. First, run git clone https://github.com/facebookresearch/iopath --single-branch --branch v0.1.8. Then on line 753 in iopath/iopath/common/file_io.py change filename = path.split("/")[-1] to filename = parsed_url.path.split("/")[-1]. After that, navigate to the iopath directory and run pip install -e ..

  • Run pip install unstructured[local-inference]. This will install the unstructured_inference dependency.

At this point, you can verify the installation by running the following from the root directory of the unstructured repo:

from unstructured.partition.pdf import partition_pdf

partition_pdf("example-docs/layout-parser-paper-fast.pdf", url=None)

Installing PaddleOCR

PaddleOCR is another package that is helpful to use in conjunction with unstructured. You can use the following steps to install paddleocr in your unstructured conda environment.

  • Run conda install -c esri paddleocr

  • If you have the Windows version of detectron2 cloned and installed locally, change the name of detectron2/tools to detectron2/detectron2_tools. Otherwise, you will hit the module name conflict error described in this issue.

  • Set the environment variable KMP_DUPLICATE_LIB_OK to "TRUE". This prevents the libiomp5md.dll linking issue described in this issue on GitHub.

At this point, you can verify the installation using the following commands. Choose a .jpg image that contains text.

import numpy as np
from PIL import Image
from paddleocr import PaddleOCR

filename = "path/to/my/image.jpg"
img = np.array(Image.open(filename))
ocr = PaddleOCR(lang="en", use_gpu=False, show_log=False)
result = ocr.ocr(img=img)

Logging

You can set the logging level for the package with the LOG_LEVEL environment variable. By default, the log level is set to WARNING. For debugging, consider setting the log level to INFO or DEBUG.

Extra Dependencies

Filetype Detection

The filetype module in unstructured uses libmagic to detect MIME types. For this to work, you’ll need libmagic installed on your computer. On a Mac, you can run:

$ brew install libmagic

One Debian, run:

$ sudo apt-get install -y libmagic-dev

If you are on Windows using conda, run:

$ conda install -c conda-forge libmagic

XML/HTML Dependencies

For XML and HTML parsing, you’ll need libxml2 and libxlst installed. On a Mac, you can do that with:

$ brew install libxml2
$ brew install libxslt

Huggingface Dependencies

The transformers requires the Rust compiler to be present on your system in order to properly pip install. If a Rust compiler is not available on your system, you can run the following command to install it:

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Additionally, some tokenizers in the transformers library required the sentencepiece library. This is not included as an unstructured dependency because it only applies to some tokenizers. See the sentencepiece install instructions for information on how to install sentencepiece if your tokenizer requires it.

Note on Older Versions

For versions earlier than unstructured<0.9.0, the following installation pattern was recommended:

pip install "unstructured[local-inference]"

While “local-inference” remains supported in newer versions for backward compatibility, it might be deprecated in future releases. It’s advisable to transition to the “all-docs” extra for comprehensive support.