proomping 2

This commit is contained in:
Kilian Schuettler 2023-08-08 10:57:20 +02:00
parent b2621fb4b3
commit d0126d36a7

View File

@ -16,7 +16,7 @@ The Section, Table, TableCell, Paragraph, and Headline implement a common interf
- TableCells may have any child, except TableCells.
- Paragraphs and Headlines have no children.
- Sections may have any child except TableCells, but if it contains Paragraphs as well as Tables, it is split into a Section with multiple Sections as children, where any child Section only contains either Tables or Paragraphs.
The first Headline remains in the Parent Section, while all others are put into the child section they belong to.
Further, if the first SemanticNode is a Headline it remains the first child in the Parent Section, before any subsections.
----------------------------------------------------------------
The relevant functions for SemanticNode:
@ -49,10 +49,12 @@ Set<Page> getPages()
* @param pageNumber The page number to check.
* @return True if this node is found on the specified page number, false otherwise.
*/
boolean isOnPage(int pageNumber)
boolean onPage(int pageNumber)
/**
* Returns the closest Headline associated with this SemanticNode
* For Sections it searches its children and returns the first Headline.
* For Paragraphs, Tables, and TableCells it returns getHeadline() of getParent()
* For Headline it returns itself and for Headers or Footers it returns an empty dummy Headline.
*
* @return First Headline found.
*/
@ -60,7 +62,7 @@ Headline getHeadline()
/**
* @return The SemanticNode representing the Parent in the DocumentTree
* throws NotFoundException, when no parent is present
* When no parent is present, the Document is returned. And for the Document itself it throws an UnsupportedOperationException.
*/
SemanticNode getParent()
@ -330,6 +332,10 @@ The Page Object has the following functions:
*/
public TextBlock getMainBodyTextBlock()
/**
* @return All SemanticNodes that occur on the page, except Header and Footer
*/
public List<SemanticNode> getMainBody()
/**
* Gets all Entities located on the page
* @return Set of all Entities associated with this Page
*/
@ -341,8 +347,8 @@ Set<RedactionEntity> getEntities();
*/
Integer getPageNumber();
----------------------------------------------------------------
The goal of the Rules is to find pieces of Text that we want to redact.
There are two different types of rules, during one you create new Entities and in the other you change or remove existing Entities.
The goal of the Rules is to find pieces of Text that we want to redact. These pieces of text are represented as RedactionEntities
There are two different types of rules, one you create new Entities and in the other you change or remove existing Entities.
An Entity is any piece of text, uniquely identified in the Document by its Boundary, its Type and its EntityType. The Boundary consists of a start and stop index in the text of the document.
The Type is a String like "PII", which stands for
The goal is to find entities that fulfill certain conditions. Each SemanticNode has its own set of entities, but these sets may have intersections.