Cleaning

As part of data preparation for an NLP model, it’s common to need to clean up your data prior to passing it into the model. If there’s unwanted content in your output, for example, it could impact the quality of your NLP model. To help with this, the unstructured library includes cleaning functions to help users sanitize output before sending it to downstream applications.

Some cleaning functions apply automatically. In the example in the Partition section, the output Philadelphia Eaglesâ\x80\x99 victory automatically gets converted to Philadelphia Eagles' victory in partition_html using the replace_unicode_quotes cleaning function. You can see how that works in the code snippet below:

from unstructured.cleaners.core import replace_unicode_quotes

replace_unicode_quotes("Philadelphia Eaglesâ\x80\x99 victory")

Document elements in unstructured include an apply method that allow you to apply the text cleaning to the document element without instantiating a new element. The apply method expects a callable that takes a string as input and produces another string as output. In the example below, we invoke the replace_unicode_quotes cleaning function using the apply method.

from unstructured.documents.elements import Text

element = Text("Philadelphia Eaglesâ\x80\x99 victory")
element.apply(replace_unicode_quotes)
print(element)

Since a cleaning function is just a str -> str function, users can also easily include their own cleaning functions for custom data preparation tasks. In the example below, we remove citations from a section of text.

import re

remove_citations = lambda text: re.sub("\[\d{1,3}\]", "", text)

element = Text("[1] Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.")
element.apply(remove_citations)
print(element)

See below for a full list of cleaning functions in the unstructured library.

bytes_string_to_string

Converts an output string that looks like a byte string to a string using the specified encoding. This happens sometimes in partition_html when there is a character like an emoji that isn’t expected by the HTML parser. In that case, the encoded bytes get processed.

Examples:

from unstructured.cleaners.core import bytes_string_to_string

text = "Hello ð\x9f\x98\x80"
# The output should be "Hello 😀"
bytes_string_to_string(text, encoding="utf-8")
from unstructured.cleaners.core import bytes_string_to_string
from unstructured.partition.html import partition_html

text = """\n<html charset="utf-8"><p>Hello 😀</p></html>"""
elements = partition_html(text=text)
elements[0].apply(bytes_string_to_string)
# The output should be "Hello 😀"
elements[0].text

For more information about the bytes_string_to_string function, you can check the source code here.

clean

Cleans a section of text with options including removing bullets, extra whitespace, dashes and trailing punctuation. Optionally, you can choose to lowercase the output.

Options:

  • Applies clean_bullets if bullets=True.

  • Applies clean_extra_whitespace if extra_whitespace=True.

  • Applies clean_dashes if dashes=True.

  • Applies clean_trailing_punctuation if trailing_punctuation=True.

  • Lowercases the output if lowercase=True.

Examples:

from unstructured.cleaners.core import clean

# Returns "an excellent point!"
clean("● An excellent point!", bullets=True, lowercase=True)

# Returns "ITEM 1A: RISK FACTORS"
clean("ITEM 1A:     RISK-FACTORS", extra_whitespace=True, dashes=True)

For more information about the clean function, you can check the source code here.

clean_bullets

Removes bullets from the beginning of text. Bullets that do not appear at the beginning of the text are not removed.

Examples:

from unstructured.cleaners.core import clean_bullets

# Returns "An excellent point!"
clean_bullets("● An excellent point!")

# Returns "I love Morse Code! ●●●"
clean_bullets("I love Morse Code! ●●●")

For more information about the clean_bullets function, you can check the source code here.

clean_dashes

Removes dashes from a section of text. Also handles special characters such as \u2013.

Examples:

from unstructured.cleaners.core import clean_dashes

# Returns "ITEM 1A: RISK FACTORS"
clean_dashes("ITEM 1A: RISK-FACTORS\u2013")

For more information about the clean_dashes function, you can check the source code here.

clean_extra_whitespace

Removes extra whitespace from a section of text. Also handles special characters such as \xa0 and newlines.

Examples:

from unstructured.cleaners.core import clean_extra_whitespace

# Returns "ITEM 1A: RISK FACTORS"
clean_extra_whitespace("ITEM 1A:     RISK FACTORS\n")

For more information about the clean_extra_whitespace function, you can check the source code here.

clean_non_ascii_chars

Removes non-ascii characters from a string.

Examples:

from unstructured.cleaners.core import clean_non_ascii_chars

text = "\x88This text contains®non-ascii characters!●"

# Returns "This text containsnon-ascii characters!"
clean_non_ascii_chars(text)

For more information about the clean_non_ascii_chars function, you can check the source code here.

clean_ordered_bullets

Remove alphanumeric bullets from the beginning of text up to three “sub-section” levels.

Examples:

from unstructured.cleaners.core import clean_ordered_bullets

# Returns "This is a very important point"
clean_bullets("1.1 This is a very important point")

# Returns "This is a very important point ●"
clean_bullets("a.b This is a very important point ●")

For more information about the clean_ordered_bullets function, you can check the source code here.

clean_postfix

Removes the postfix from a string if they match a specified pattern.

Options:

  • Ignores case if ignore_case is set to True. The default is False.

  • Strips trailing whitespace is strip is set to True. The default is True.

Examples:

from unstructured.cleaners.core import clean_postfix

text = "The end! END"

# Returns "The end!"
clean_postfix(text, r"(END|STOP)", ignore_case=True)

For more information about the clean_postfix function, you can check the source code here.

clean_prefix

Removes the prefix from a string if they match a specified pattern.

Options:

  • Ignores case if ignore_case is set to True. The default is False.

  • Strips leading whitespace is strip is set to True. The default is True.

Examples:

from unstructured.cleaners.core import clean_prefix

text = "SUMMARY: This is the best summary of all time!"

# Returns "This is the best summary of all time!"
clean_prefix(text, r"(SUMMARY|DESCRIPTION):", ignore_case=True)

For more information about the clean_prefix function, you can check the source code here.

clean_trailing_punctuation

Removes trailing punctuation from a section of text.

Examples:

from unstructured.cleaners.core import clean_trailing_punctuation

# Returns "ITEM 1A: RISK FACTORS"
clean_trailing_punctuation("ITEM 1A: RISK FACTORS.")

For more information about the clean_trailing_punctuation function, you can check the source code here.

group_broken_paragraphs

Groups together paragraphs that are broken up with line breaks for visual or formatting purposes. This is common in .txt files. By default, group_broken_paragraphs groups together lines split by \n. You can change that behavior with the line_split kwarg. The function considers \n\n to be a paragraph break by default. You can change that behavior with the paragraph_split kwarg.

Examples:

from unstructured.cleaners.core import group_broken_paragraphs

text = """The big brown fox
was walking down the lane.

At the end of the lane, the
fox met a bear."""

group_broken_paragraphs(text)
import re
from unstructured.cleaners.core import group_broken_paragraphs

para_split_re = re.compile(r"(\s*\n\s*){3}")

text = """The big brown fox

was walking down the lane.


At the end of the lane, the

fox met a bear."""

group_broken_paragraphs(text, paragraph_split=para_split_re)

For more information about the group_broken_paragraphs function, you can check the source code here.

remove_punctuation

Removes ASCII and unicode punctuation from a string.

Examples:

from unstructured.cleaners.core import remove_punctuation

# Returns "A lovely quote"
remove_punctuation("“A lovely quote!”")

For more information about the remove_punctuation function, you can check the source code here.

replace_unicode_quotes

Replaces unicode quote characters such as \x91 in strings.

Examples:

from unstructured.cleaners.core import replace_unicode_quotes

# Returns "“A lovely quote!”"
replace_unicode_characters("\x93A lovely quote!\x94")

# Returns ""‘A lovely quote!’"
replace_unicode_characters("\x91A lovely quote!\x92")

For more information about the replace_unicode_quotes function, you can check the source code here.

translate_text

The translate_text cleaning functions translates text between languages. translate_text uses the Helsinki NLP MT models from transformers for machine translation. Works for Russian, Chinese, Arabic, and many other languages.

Parameters:

  • text: the input string to translate.

  • source_lang: the two letter language code for the source language of the text. If source_lang is not specified, the language will be detected using langdetect.

  • target_lang: the two letter language code for the target language for translation. Defaults to "en".

Examples:

from unstructured.cleaners.translate import translate_text

# Output is "I'm a Berliner!"
translate_text("Ich bin ein Berliner!")

# Output is "I can also translate Russian!"
translate_text("Я тоже можно переводать русский язык!", "ru", "en")

For more information about the translate_text function, you can check the source code here.