Installation
Quick Start
Use the following instructions to get up and running with unstructured
and test your
installation.
- Install the Python SDK with
pip install "unstructured[local-inference]"
If you do not need to process PDFs or images, you can run
pip install unstructured
- Install the Python SDK with
- Install the following system dependencies if they are not already available on your system. Depending on what document types you’re parsing, you may not need all of these.
libmagic-dev
(filetype detection)poppler-utils
(images and PDFs)tesseract-ocr
(images and PDFs)libreoffice
(MS Office docs)pandocs
(EPUBs, RTFs and Open Office docs)
Follow the instructions here to install
detectron2
. This is required if you would like to use custom models from the LayoutParser Model Zoo.
At this point, you should be able to run the following code:
from unstructured.partition.auto import partition
elements = partition(filename="example-docs/fake-email.eml")
And if you installed with local-inference, you should be able to run this as well:
from unstructured.partition.auto import partition
elements = partition("example-docs/layout-parser-paper.pdf")
Installation with conda
on Windows
You can install and run unstructured
on Windows with conda
, but the process
involves a few extra steps. This section will help you get up and running.
Install Anaconda on your Windows machine.
Install Microsoft C++ Build Tools using the instructions in this Stackoverflow post. C++ build tools are required for the
pycocotools
dependency.Run
conda env create -f environment.yml
using theenvironment.yml
file in theunstructured
repo to create a virtual environment. The environment will be namedunstructured
.Run
conda activate unstructured
to activate the virtualenvironment.Run
pip install unstructured
to install theunstructured
library.
Setting up unstructured
for local inference
If you need to run model inferences locally, there are a few additional steps you need to
take. The main challenge is installing detectron2
for PDF layout parsing. detectron2
does not officially support Windows, but it is possible to get it to install on Windows.
The installation instructions are based on the instructions LayoutParser provides
here.
Run
pip install pycocotools-windows
to install a Windows compatible version ofpycocotools
. Alternatively, you can runpip3 install "git+https://github.com/philferriere/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI"
as outlined in this GitHub issue.Run
git clone https://github.com/ivanpp/detectron2.git
, thencd detectron2
, thenpip install -e .
to install a Windows compatible version of thedetectron2
library.Install the a Windows compatible version of
iopath
using the instructions outlined in this GitHub issue. First, rungit clone https://github.com/facebookresearch/iopath --single-branch --branch v0.1.8
. Then on line 753 iniopath/iopath/common/file_io.py
changefilename = path.split("/")[-1]
tofilename = parsed_url.path.split("/")[-1]
. After that, navigate to theiopath
directory and runpip install -e .
.Run
pip install unstructured[local-inference]
. This will install theunstructured_inference
dependency.
At this point, you can verify the installation by running the following from the root directory of the unstructured
repo:
from unstructured.partition.pdf import partition_pdf
partition_pdf("example-docs/layout-parser-paper-fast.pdf", url=None)
Installing PaddleOCR
PaddleOCR is another package that is helpful to use in conjunction with unstructured
.
You can use the following steps to install paddleocr
in your unstructured
conda
environment.
Run
conda install -c esri paddleocr
If you have the Windows version of
detectron2
cloned and installed locally, change the name ofdetectron2/tools
todetectron2/detectron2_tools
. Otherwise, you will hit the module name conflict error described in this issue.Set the environment variable
KMP_DUPLICATE_LIB_OK
to"TRUE"
. This prevents thelibiomp5md.dll
linking issue described in this issue on GitHub.
At this point, you can verify the installation using the following commands. Choose a
.jpg
image that contains text.
import numpy as np
from PIL import Image
from paddleocr import PaddleOCR
filename = "path/to/my/image.jpg"
img = np.array(Image.open(filename))
ocr = PaddleOCR(lang="en", use_gpu=False, show_log=False)
result = ocr.ocr(img=img)
Logging
You can set the logging level for the package with the LOG_LEVEL
environment variable.
By default, the log level is set to WARNING
. For debugging, consider setting the log
level to INFO
or DEBUG
.
Extra Dependencies
Filetype Detection
The filetype
module in unstructured
uses libmagic
to detect MIME types. For
this to work, you’ll need libmagic
installed on your computer. On a Mac, you can run:
$ brew install libmagic
One Debian, run:
$ sudo apt-get install -y libmagic-dev
If you are on Windows using conda
, run:
$ conda install -c conda-forge libmagic
XML/HTML Dependencies
For XML and HTML parsing, you’ll need libxml2
and libxlst
installed. On a Mac, you can do
that with:
$ brew install libxml2
$ brew install libxslt
Huggingface Dependencies
The transformers
requires the Rust compiler to be present on your system in
order to properly pip
install. If a Rust compiler is not available on your system,
you can run the following command to install it:
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Additionally, some tokenizers in the transformers
library required the sentencepiece
library. This is not included as an unstructured
dependency because it only applies
to some tokenizers. See the
sentencepiece install instructions for
information on how to install sentencepiece
if your tokenizer requires it.