Skip to content

Commit 81649c3

Browse files
Add levels (#4)
* Added 'level' attribute to track the level of a heading * Document possible categories for textual content --------- Co-authored-by: Andrea Ponti <ponti.andrea97@gmail.com>
1 parent cbd2eb7 commit 81649c3

File tree

3 files changed

+65
-4
lines changed

3 files changed

+65
-4
lines changed

README.md

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -92,12 +92,37 @@ Represents a page in the document:
9292

9393
This node represent a paragraph, a heading or any text within the document.
9494

95-
- `category`: The type `"doc"`.
95+
- `category`: The classification of the text within the document.
9696
- `content`: A string representing the textual content.
9797
- `marks`: List of [marks](#marks) applied to the text, such as bold, italic, etc.
9898
- `attributes`: Can contain metadata like the bounding box representing where this portion of text is located in the page.
9999

100-
100+
### Category
101+
Below are the various categories of text that may be found within a document:
102+
103+
**Category Type**
104+
- `page-header`: Represents the header of the page.
105+
- `footer`: Represents the footer of the page.
106+
- `heading`: Any heading within the document.
107+
- `figure`: Represents a figure or an image.
108+
- `other`: Any other unclassified text.
109+
- `appendix`: Text within an appendix.
110+
- `keywords`: List of keywords.
111+
- `acknowledgments`: Section acknowledging contributors.
112+
- `caption`: Caption associated with a figure or table.
113+
- `toc`: Table of contents.
114+
- `abstract`: The abstract of the document.
115+
- `footnote`: Text at the bottom of the page providing additional information.
116+
- `body`: Main body text of the document.
117+
- `itemize-item`: Item in a list or bullet point.
118+
- `title`: The title of the document.
119+
- `reference`: References or citations within the document.
120+
- `affiliation`: Author's institutional affiliation.
121+
- `general-terms`: General terms section.
122+
- `formula`: Mathematical formula or equation.
123+
- `categories`: Categories or topics listed in the document.
124+
- `table`: Represents a table.
125+
- `authors`: List of authors.
101126

102127
### Marks
103128

@@ -119,8 +144,9 @@ Attributes are optional fields that can store additional information for each no
119144

120145
- `DocumentAttributes`: General attributes for the document (currently reserved for the future).
121146
- `PageAttributes`: Specific page related attributes, such as the page number.
122-
- `TextAttributes`: Text related attributes, such as bounding boxes.
147+
- `TextAttributes`: Text related attributes, such as bounding boxes or level.
123148
- `BoundingBox`: A box that specifies the position of a text in the page.
149+
- `Level`: The specific level of the text within a document, for example, for headings.
124150

125151

126152
## Getting started

parse_document_model/attributes.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from abc import ABC
2+
from typing import Optional
23

3-
from pydantic import BaseModel
4+
from pydantic import BaseModel, Field
45

56

67
class BoundingBox(BaseModel):
@@ -25,3 +26,4 @@ class PageAttributes(Attributes):
2526

2627
class TextAttributes(Attributes):
2728
bounding_box: list[BoundingBox] = []
29+
level: Optional[int] = Field(None, ge=1, le=4)

test/test_validation.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,3 +66,36 @@ def test_url_marks():
6666
else:
6767
with pytest.raises(ValueError):
6868
UrlMark(**mark_json)
69+
70+
71+
def test_text_attributes_level():
72+
valid_text_attributes = [
73+
{"bounding_box": [], "level": 1},
74+
{"bounding_box": [], "level": 2},
75+
{"bounding_box": [], "level": 3},
76+
{"bounding_box": [], "level": 4},
77+
{"bounding_box": [], "level": None},
78+
{"bounding_box": []},
79+
{}
80+
]
81+
82+
for attributes_json in valid_text_attributes:
83+
text_attributes = TextAttributes(**attributes_json)
84+
assert isinstance(text_attributes, TextAttributes)
85+
assert isinstance(text_attributes.level, (int, type(None)))
86+
if text_attributes.level is not None:
87+
assert text_attributes.level in range(1, 5)
88+
assert attributes_json["level"] == text_attributes.level
89+
else:
90+
assert "level" not in attributes_json or attributes_json["level"] is None
91+
92+
invalid_text_attributes = [
93+
{"bounding_box": [], "level": -1},
94+
{"bounding_box": [], "level": "invalid"},
95+
{"bounding_box": [], "level": 2.5},
96+
{"bounding_box": [], "level": 5},
97+
]
98+
99+
for attributes_json in invalid_text_attributes:
100+
with pytest.raises(ValueError):
101+
TextAttributes(**attributes_json)

0 commit comments

Comments
 (0)