Skip to content

Commit 4c28652

Browse files
Wauplinhanouticelinagante
committed
New @strict decorator for dataclass validation (#2895)
* New @strict_dataclass decorator * expose main methods * typog * Support Literal[...] type * Update src/huggingface_hub/utils/_strict_dataclass.py Co-authored-by: Célina <hanouticelina@gmail.com> * nit * accept kwargs * Accept kwargs, move to huggingface.dataclasses, fix autocompletion, add tests, add docs * docs * @as_validated_field decorator * code quality * class validators * inherit class validators from not strict classes * Update docs/source/en/package_reference/dataclasses.md Co-authored-by: célina <hanouticelina@gmail.com> * remove duplicated definition of _setattr * Update docs/source/en/package_reference/dataclasses.md Co-authored-by: célina <hanouticelina@gmail.com> * optional is an alias for union[, None] * dumb tests * Raise if already defined by user * docs * Update docs/source/en/package_reference/dataclasses.md Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * Update docs/source/en/package_reference/dataclasses.md Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * doc --------- Co-authored-by: Célina <hanouticelina@gmail.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
1 parent f59601f commit 4c28652

File tree

6 files changed

+1338
-1
lines changed

6 files changed

+1338
-1
lines changed

docs/source/en/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,3 +86,7 @@
8686
title: Webhooks server
8787
- local: package_reference/serialization
8888
title: Serialization
89+
- local: package_reference/dataclasses
90+
title: Strict dataclasses
91+
- local: package_reference/oauth
92+
title: OAuth
Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
# Strict Dataclasses
2+
3+
The `huggingface_hub` package provides a utility to create **strict dataclasses**. These are enhanced versions of Python's standard `dataclass` with additional validation features. Strict dataclasses ensure that fields are validated both during initialization and assignment, making them ideal for scenarios where data integrity is critical.
4+
5+
## Overview
6+
7+
Strict dataclasses are created using the `@strict` decorator. They extend the functionality of regular dataclasses by:
8+
9+
- Validating field types based on type hints
10+
- Supporting custom validators for additional checks
11+
- Optionally allowing arbitrary keyword arguments in the constructor
12+
- Validating fields both at initialization and during assignment
13+
14+
## Benefits
15+
16+
- **Data Integrity**: Ensures fields always contain valid data
17+
- **Ease of Use**: Integrates seamlessly with Python's `dataclass` module
18+
- **Flexibility**: Supports custom validators for complex validation logic
19+
- **Lightweight**: Requires no additional dependencies such as Pydantic, attrs, or similar libraries
20+
21+
## Usage
22+
23+
### Basic Example
24+
25+
```python
26+
from dataclasses import dataclass
27+
from huggingface_hub.dataclasses import strict, as_validated_field
28+
29+
# Custom validator to ensure a value is positive
30+
@as_validated_field
31+
def positive_int(value: int):
32+
if not value > 0:
33+
raise ValueError(f"Value must be positive, got {value}")
34+
35+
@strict
36+
@dataclass
37+
class Config:
38+
model_type: str
39+
hidden_size: int = positive_int(default=16)
40+
vocab_size: int = 32 # Default value
41+
42+
# Methods named `validate_xxx` are treated as class-wise validators
43+
def validate_big_enough_vocab(self):
44+
if self.vocab_size < self.hidden_size:
45+
raise ValueError(f"vocab_size ({self.vocab_size}) must be greater than hidden_size ({self.hidden_size})")
46+
```
47+
48+
Fields are validated during initialization:
49+
50+
```python
51+
config = Config(model_type="bert", hidden_size=24) # Valid
52+
config = Config(model_type="bert", hidden_size=-1) # Raises StrictDataclassFieldValidationError
53+
```
54+
55+
Consistency between fields is also validated during initialization (class-wise validation):
56+
57+
```python
58+
# `vocab_size` too small compared to `hidden_size`
59+
config = Config(model_type="bert", hidden_size=32, vocab_size=16) # Raises StrictDataclassClassValidationError
60+
```
61+
62+
Fields are also validated during assignment:
63+
64+
```python
65+
config.hidden_size = 512 # Valid
66+
config.hidden_size = -1 # Raises StrictDataclassFieldValidationError
67+
```
68+
69+
To re-run class-wide validation after assignment, you must call `.validate` explicitly:
70+
71+
```python
72+
config.validate() # Runs all class validators
73+
```
74+
75+
### Custom Validators
76+
77+
You can attach multiple custom validators to fields using [`validated_field`]. A validator is a callable that takes a single argument and raises an exception if the value is invalid.
78+
79+
```python
80+
from dataclasses import dataclass
81+
from huggingface_hub.dataclasses import strict, validated_field
82+
83+
def multiple_of_64(value: int):
84+
if value % 64 != 0:
85+
raise ValueError(f"Value must be a multiple of 64, got {value}")
86+
87+
@strict
88+
@dataclass
89+
class Config:
90+
hidden_size: int = validated_field(validator=[positive_int, multiple_of_64])
91+
```
92+
93+
In this example, both validators are applied to the `hidden_size` field.
94+
95+
### Additional Keyword Arguments
96+
97+
By default, strict dataclasses only accept fields defined in the class. You can allow additional keyword arguments by setting `accept_kwargs=True` in the `@strict` decorator.
98+
99+
```python
100+
from dataclasses import dataclass
101+
from huggingface_hub.dataclasses import strict
102+
103+
@strict(accept_kwargs=True)
104+
@dataclass
105+
class ConfigWithKwargs:
106+
model_type: str
107+
vocab_size: int = 16
108+
109+
config = ConfigWithKwargs(model_type="bert", vocab_size=30000, extra_field="extra_value")
110+
print(config) # ConfigWithKwargs(model_type='bert', vocab_size=30000, *extra_field='extra_value')
111+
```
112+
113+
Additional keyword arguments appear in the string representation of the dataclass but are prefixed with `*` to highlight that they are not validated.
114+
115+
### Integration with Type Hints
116+
117+
Strict dataclasses respect type hints and validate them automatically. For example:
118+
119+
```python
120+
from typing import List
121+
from dataclasses import dataclass
122+
from huggingface_hub.dataclasses import strict
123+
124+
@strict
125+
@dataclass
126+
class Config:
127+
layers: List[int]
128+
129+
config = Config(layers=[64, 128]) # Valid
130+
config = Config(layers="not_a_list") # Raises StrictDataclassFieldValidationError
131+
```
132+
133+
Supported types include:
134+
- Any
135+
- Union
136+
- Optional
137+
- Literal
138+
- List
139+
- Dict
140+
- Tuple
141+
- Set
142+
143+
And any combination of these types. If your need more complex type validation, you can do it through a custom validator.
144+
145+
### Class validators
146+
147+
Methods named `validate_xxx` are treated as class validators. These methods must only take `self` as an argument. Class validators are run once during initialization, right after `__post_init__`. You can define as many of them as needed—they'll be executed sequentially in the order they appear.
148+
149+
Note that class validators are not automatically re-run when a field is updated after initialization. To manually re-validate the object, you need to call `obj.validate()`.
150+
151+
```py
152+
from dataclasses import dataclass
153+
from huggingface_hub.dataclasses import strict
154+
155+
@strict
156+
@dataclass
157+
class Config:
158+
foo: str
159+
foo_length: int
160+
upper_case: bool = False
161+
162+
def validate_foo_length(self):
163+
if len(self.foo) != self.foo_length:
164+
raise ValueError(f"foo must be {self.foo_length} characters long, got {len(self.foo)}")
165+
166+
def validate_foo_casing(self):
167+
if self.upper_case and self.foo.upper() != self.foo:
168+
raise ValueError(f"foo must be uppercase, got {self.foo}")
169+
170+
config = Config(foo="bar", foo_length=3) # ok
171+
172+
config.upper_case = True
173+
config.validate() # Raises StrictDataclassClassValidationError
174+
175+
Config(foo="abcd", foo_length=3) # Raises StrictDataclassFieldValidationError
176+
Config(foo="Bar", foo_length=3, upper_case=True) # Raises StrictDataclassFieldValidationError
177+
```
178+
179+
<Tip warning={true}>
180+
181+
Method `.validate()` is a reserved name on strict dataclasses.
182+
To prevent unexpected behaviors, a [`StrictDataclassDefinitionError`] error will be raised if your class already defines one.
183+
184+
</Tip>
185+
186+
## API Reference
187+
188+
### `@strict`
189+
190+
The `@strict` decorator enhances a dataclass with strict validation.
191+
192+
[[autodoc]] dataclasses.strict
193+
194+
### `as_validated_field`
195+
196+
Decorator to create a [`validated_field`]. Recommended for fields with a single validator to avoid boilerplate code.
197+
198+
[[autodoc]] dataclasses.as_validated_field
199+
200+
### `validated_field`
201+
202+
Creates a dataclass field with custom validation.
203+
204+
[[autodoc]] dataclasses.validated_field
205+
206+
### Errors
207+
208+
[[autodoc]] errors.StrictDataclassError
209+
210+
[[autodoc]] errors.StrictDataclassDefinitionError
211+
212+
[[autodoc]] errors.StrictDataclassFieldValidationError
213+
214+
## Why Not Use `pydantic`? (or `attrs`? or `marshmallow_dataclass`?)
215+
216+
- See discussion in https://github.com/huggingface/transformers/issues/36329 regarding adding Pydantic as a dependency. It would be a heavy addition and require careful logic to support both v1 and v2.
217+
- We don't need most of Pydantic's features, especially those related to automatic casting, jsonschema, serialization, aliases, etc.
218+
- We don't need the ability to instantiate a class from a dictionary.
219+
- We don't want to mutate data. In `@strict`, "validation" means "checking if a value is valid." In Pydantic, "validation" means "casting a value, possibly mutating it, and then checking if it's valid."
220+
- We don't need blazing-fast validation. `@strict` isn't designed for heavy loads where performance is critical. Common use cases involve validating a model configuration (performed once and negligible compared to running a model). This allows us to keep the code minimal.

0 commit comments

Comments
 (0)