Metadata
The unstructured
package tracks a variety of metadata about Elements extracted from documents.
Element metadata has a variety of uses including:
* filtering document elements based on an element metadata value, for example, elements from a given page number or an e-mail with a subject matching a regular expression.
* mapping an element to the document page where it occurred so that original page can be retrieved when that element matches search criteria.
Metadata is tracked at the element level. You can extract the metadata for a given document element
with element.metadata
. For a dictionary representation, use element.metadata.to_dict()
.
Common Metadata Fields
All document types return the following metadata fields when the information is available from the source file:
Metadata Field Name |
Short Description |
Details |
---|---|---|
filename |
Filename |
|
file_directory |
File Directory |
|
last_modified |
Last Modified Date |
|
filetype |
File Type |
|
coordinates |
XY Bounding Box Coordinates |
See notes below for further details about the bounding box. |
parent_id |
Element Hierarchy (Parent ID) |
parent_id may be used to infer where an element resides within the overall hierarchy of a document. For instance, a NarrativeText element may have a Title element as a parent (a “sub-title”), which in turn may have another Title element as its parent (a “title). |
category_depth |
Element Depth relative to other elements of the same category |
Category depth is the depth of an element relative to other elements of the same category. It’s set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. Category depth may be set using native document hierarchies, e.g. reflecting <H1>, <H2>, or <H3> tags within an HTML document or the indentation level of a bulleted list item in a Word document. |
text_as_html |
HTML representation of extracted tables |
Only applicable to table elements. |
languages |
Document Languages |
At document level or element level. List is ordered by probability of being the primary language of the text. |
emphasized_text_contents |
Emphasized text (bold or italic) in the original document |
|
emphasized_text_tags |
Tags on text that is emphasized in the original document |
|
is_continuation |
True if element is a continuation of a previous element |
Only relevant for chunking, if an element was divided into two due to |
detection_class_prob |
Detection model class probabilities |
From unstructured-inference, hi-res strategy. |
Notes on common metadata fields:
Metadata for Document Hierarchy: parent_id
and category_depth
Parent ID and Category Depth enhance hierarchy detection to identify the document structure in various file formats. It measures the relative depth of an element within its category. This is especially useful in documents with native hierarchies like HTML or Word files, where elements like headings or list items inherently define structure.
Implementations:
Parent ID: Introduction of a
parent_id
metadata, identifying the parent element.Category Depth: Introduction of a
category_depth
metadata, identifying the element’s depth level.Rule set for hierarchy assignment: Elements are sequentially processed against a ruleset, which set the Parent IDs and Category Depths, based on predetermined or custom rules.
Hierarchy in DOCX Files
The process of enhancing hierarchy detection by determining category_depth
includes:
Evaluating the paragraph item for an indentation level (ilvl) xpath, commonly found in list bullet/number formats. If an indentation level is present, it is used as the category depth.
Inspecting the paragraph style name for any indications of category depth, such as differences between ‘Heading 1’ and ‘Heading 2’ or various list bullet styles. The detected category depth is used, or it defaults to 0 if none is found.
The paragraph’s indentation level (ilvl) is assessed based on its style name. This involves a detailed lookup beyond the paragraph’s metadata, as docx files have predefined ilvls for different style names. This aspect is in development, with the existing methods sufficiently addressing most scenarios.
Hierarchy in PPTX Files
The enhancement of hierarchy detection in determining category depth is achieved as follows:
Examining if the paragraph within the
python-pptx
document has a level parameter. When present, this level is adopted as thecategory_depth
.In cases where the evaluated shape is a title and the content is neither a bullet nor an email, the element is designated as a
Title
. The depth is assigned based on the paragraph’s sequence (e.g., the first line in the title shape is depth 0, the second line is depth 1, and so on).For situations where the shape is not a title but the paragraph is, the depth is set to the level plus one. This ensures that paragraph titles have a minimum depth of 1, positioning them appropriately under the main slide title element.
coordinates
Some document types support location data for the elements, usually in the form of bounding boxes.
If it exists, an element’s location data is available with element.metadata.coordinates
.
The coordinates
property of an ElementMetadata
stores:
points
: These specify the corners of the bounding box starting from the top left corner and proceeding counter-clockwise. The points represent pixels, the origin is in the top left and they
coordinate increases in the downward direction.system
: The points have an associated coordinate system. A typical example of a coordinate system isPixelSpace
, which is used for representing the coordinates of images. The coordinate system has a name, orientation, layout width, and layout height.
Information about the element’s coordinates (including the coordinate system name, coordinate points, the layout width, and the layout height) can be accessed with element.to_dict()[“metadata”][“coordinates”].
The coordinates of an element can be changed to a new coordinate system by using the
Element.convert_coordinates_to_new_system
method. If the in_place
flag is True
, the
coordinate system and points of the element are updated in place and the new coordinates are
returned. If the in_place
flag is False
, only the altered coordinates are returned.
from unstructured.documents.elements import Element
from unstructured.documents.coordinates import PixelSpace, RelativeCoordinateSystem
coordinates = ((10, 10), (10, 100), (200, 100), (200, 10))
coordinate_system = PixelSpace(width=850, height=1100)
element = Element(coordinates=coordinates, coordinate_system=coordinate_system)
print(element.metadata.coordinates.to_dict())
print(element.metadata.coordinates.system.orientation)
print(element.metadata.coordinates.system.width)
print(element.metadata.coordinates.system.height)
element.convert_coordinates_to_new_system(RelativeCoordinateSystem(), in_place=True)
# Should now be in terms of new coordinate system
print(element.metadata.coordinates.to_dict())
print(element.metadata.coordinates.system.orientation)
print(element.metadata.coordinates.system.width)
print(element.metadata.coordinates.system.height)
Additional Metadata Fields by Document Type
Field Name |
Applicable Doc Types |
Short Description |
---|---|---|
page_number |
DOCX, PDF, PPT, XLSX |
Page Number |
page_name |
XLSX |
Sheet Name in Excel document |
sent_from |
EML |
Email Sender |
sent_to |
EML |
Email Recipient |
subject |
EML |
Email Subject |
attached_to_filename |
MSG |
filename that attachment file is attached to |
header_footer_type |
Word Doc |
Pages a header or footer applies to: “primary”, “even_only”, and “first_page” |
link_urls |
HTML |
The url associated with a link in a document. |
link_texts |
HTML |
The text associated with a link in a document. |
links |
HTML |
List of {”text”: “<the text>, “url”: <the url>} items. Note: this element will be removed in the near future in favor of the above link_urls and link_texts. |
section |
EPUB |
Book section title corresponding to table of contents |
Notes on additional metadata by document type:
Email
Emails will include sent_from
, sent_to
, and subject
metadata.
sent_from
is a list of strings because the RFC 822
spec for emails allows for multiple sent from email addresses.
Microsoft Excel Documents
For Excel documents, ElementMetadata
will contain a page_name
element, which corresponds
to the sheet name in the Excel document.
Microsoft Word Documents
Headers and footers in Word documents include a header_footer_type
indicating which page
a header or footer applies to. Valid values are "primary"
, "even_only"
, and "first_page"
.
Data Connector Metadata Fields
Documents processed through unstructured-ingest connectors include additional document metadata. These additional fields only ever appear if the source document was processed by a connector.
Common Data Connector Metadata Fields
- Data Source metadata (on json output):
url
version
date created
date modified
date processed
record locator
Record locator is specific to each connector
Additional Metadata Fields by Connector Type (via record locator)
- airtable
base id
table id
view id
- azure (from fsspec)
protocol
remote file path
- box (from fsspec)
protocol
remote file path
- confluence
url
page id
- discord
channel
- dropbox (from fsspec)
protocol
remote file path
- elasticsearch
url
index name
document id
- fsspec
protocol
remote file path
- google drive
drive id
file id
- gcs (from fsspec)
protocol
remote file path
- jira
base url
issue key
- onedrive
user pname
server relative path
- outlook
message id
user email
- s3 (from fsspec)
protocol
remote file path
- sharepoint
server path
site url
- wikipedia
page title
page url
Advanced Metadata Options
Extract Metadata with Regexes
unstructured
allows users to extract additional metadata with regexes using the regex_metadata
kwarg.
Here is an example of how to extract regex metadata:
from unstructured.partition.text import partition_text
text = "SPEAKER 1: It is my turn to speak now!"
elements = partition_text(text=text, regex_metadata={"speaker": r"SPEAKER \d{1,3}:"})
elements[0].metadata.regex_metadata
The result will look like:
{'speaker':
[
{
'text': 'SPEAKER 1:',
'start': 0,
'end': 10,
}
]
}