Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents.
It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Today, many companies manually extract data from scanned documents such as PDFs, images, tables, and forms, or through simple OCR software that requires manual configuration (which often must be updated when the form changes). To overcome these manual and expensive processes, Textract
uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort.
Textract
supports JPEG
, PNG
, PDF
, and TIFF
file formats; more information is available in the documentation.
The following examples demonstrate the use of Amazon Textract
in combination with LangChain as a DocumentLoader.
Example 1: Loading from a local file
The first example uses a local file, which internally will be sent to Amazon Textract sync API DetectDocumentText. Local files or URL endpoints like HTTP:// are limited to one page documents for Textract. Multi-page documents have to reside on S3. This sample file is a jpeg.Example 2: Loading from a URL
The next example loads a file from an HTTPS endpoint. It has to be single page, as Amazon Textract requires all multi-page documents to be stored on S3.Example 3: Loading multi-page PDF documents
Processing a multi-page document requires the document to be on S3. The sample document resides in a bucket in us-east-2 and Textract needs to be called in that same region to be successful, so we set the region_name on the client and pass that in to the loader to ensure Textract is called from us-east-2. You could also to have your notebook running in us-east-2, setting the AWS_DEFAULT_REGION set to us-east-2 or when running in a different environment, pass in a boto3 Textract client with that region name like in the cell below.Example 4: Customizing the output format
When Amazon Textract processes a PDF, it extracts all text, including elements like headers, footers, and page numbers. This extra information can be “noisy” and reduce the effectiveness of the output. The process of converting a document’s 2D layout into a clean, one-dimensional string of text is called linearization. The AmazonTextractPDFLoader gives you precise control over this process with thelinearization_config
parameter. You can use it to specify which elements to exclude from the final output.
The following example shows how to hide headers, footers, and figures, resulting in a much cleaner text block, for more advanced use cases see this AWS blog post.