Theunstructured
package from Unstructured.IO extracts clean text from raw source documents like PDFs and Word documents. This page covers how to use theunstructured
ecosystem within LangChain.
Installation and Setup
If you are using a loader that runs locally, use the following steps to getunstructured
and its
dependencies running.
-
For the smallest installation footprint and to take advantage of features not available in the
open-source
unstructured
package, install the Python SDK withpip install unstructured-client
along withpip install langchain-unstructured
to use theUnstructuredLoader
and partition remotely against the Unstructured API. This loader lives in a LangChain partner repo instead of thelangchain-community
repo and you will need anapi_key
, which you can generate a free key here.- Unstructured’s documentation for the sdk can be found here: https://docs.unstructured.io/api-reference/api-services/sdk
-
To run everything locally, install the open-source python package with
pip install unstructured
along withpip install langchain-community
and use the sameUnstructuredLoader
as mentioned above.- You can install document specific dependencies with extras, e.g.
pip install "unstructured[docx]"
. Learn more about extras here. - To install the dependencies for all document types, use
pip install "unstructured[all-docs]"
.
- You can install document specific dependencies with extras, e.g.
-
Install the following system dependencies if they are not already available on your system with e.g.
brew install
for Mac. Depending on what document types you’re parsing, you may not need all of these.libmagic-dev
(filetype detection)poppler-utils
(images and PDFs)tesseract-ocr
(images and PDFs)qpdf
(PDFs)libreoffice
(MS Office docs)pandoc
(EPUBs)
- When running locally, Unstructured also recommends using Docker by following this guide to ensure all system dependencies are installed correctly.
Data Loaders
The primary usage ofUnstructured
is in data loaders.
UnstructuredLoader
See a usage example to see how you can use this loader for both partitioning locally and remotely with the serverless Unstructured API.UnstructuredCHMLoader
CHM
means Microsoft Compiled HTML Help
.
UnstructuredCSVLoader
Acomma-separated values
(CSV
) file is a delimited text file that uses
a comma to separate values. Each line of the file is a data record.
Each record consists of one or more fields, separated by commas.
See a usage example.
UnstructuredEmailLoader
See a usage example.UnstructuredEPubLoader
EPUB is ane-book file format
that uses
the “.epub” file extension. The term is short for electronic publication and
is sometimes styled ePub
. EPUB
is supported by many e-readers, and compatible
software is available for most smartphones, tablets, and computers.
See a usage example.
UnstructuredExcelLoader
See a usage example.UnstructuredFileIOLoader
See a usage example.UnstructuredHTMLLoader
See a usage example.UnstructuredImageLoader
See a usage example.UnstructuredMarkdownLoader
See a usage example.UnstructuredODTLoader
TheOpen Document Format for Office Applications (ODF)
, also known as OpenDocument
,
is an open file format for word processing documents, spreadsheets, presentations
and graphics and using ZIP-compressed XML files. It was developed with the aim of
providing an open, XML-based file format specification for office applications.
See a usage example.
UnstructuredOrgModeLoader
An Org Mode document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs. See a usage example.UnstructuredPDFLoader
See a usage example.UnstructuredPowerPointLoader
See a usage example.UnstructuredRSTLoader
AreStructured Text
(RST
) file is a file format for textual data
used primarily in the Python programming language community for technical documentation.
See a usage example.
UnstructuredRTFLoader
See a usage example in the API documentation.UnstructuredTSVLoader
Atab-separated values
(TSV
) file is a simple, text-based file format for storing tabular data.
Records are separated by newlines, and values within a record are separated by tab characters.
See a usage example.