|
| 1 | +# Strict Dataclasses |
| 2 | + |
| 3 | +The `huggingface_hub` package provides a utility to create **strict dataclasses**. These are enhanced versions of Python's standard `dataclass` with additional validation features. Strict dataclasses ensure that fields are validated both during initialization and assignment, making them ideal for scenarios where data integrity is critical. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Strict dataclasses are created using the `@strict` decorator. They extend the functionality of regular dataclasses by: |
| 8 | + |
| 9 | +- Validating field types based on type hints |
| 10 | +- Supporting custom validators for additional checks |
| 11 | +- Optionally allowing arbitrary keyword arguments in the constructor |
| 12 | +- Validating fields both at initialization and during assignment |
| 13 | + |
| 14 | +## Benefits |
| 15 | + |
| 16 | +- **Data Integrity**: Ensures fields always contain valid data |
| 17 | +- **Ease of Use**: Integrates seamlessly with Python's `dataclass` module |
| 18 | +- **Flexibility**: Supports custom validators for complex validation logic |
| 19 | +- **Lightweight**: Requires no additional dependencies such as Pydantic, attrs, or similar libraries |
| 20 | + |
| 21 | +## Usage |
| 22 | + |
| 23 | +### Basic Example |
| 24 | + |
| 25 | +```python |
| 26 | +from dataclasses import dataclass |
| 27 | +from huggingface_hub.dataclasses import strict, as_validated_field |
| 28 | + |
| 29 | +# Custom validator to ensure a value is positive |
| 30 | +@as_validated_field |
| 31 | +def positive_int(value: int): |
| 32 | + if not value > 0: |
| 33 | + raise ValueError(f"Value must be positive, got {value}") |
| 34 | + |
| 35 | +@strict |
| 36 | +@dataclass |
| 37 | +class Config: |
| 38 | + model_type: str |
| 39 | + hidden_size: int = positive_int(default=16) |
| 40 | + vocab_size: int = 32 # Default value |
| 41 | + |
| 42 | + # Methods named `validate_xxx` are treated as class-wise validators |
| 43 | + def validate_big_enough_vocab(self): |
| 44 | + if self.vocab_size < self.hidden_size: |
| 45 | + raise ValueError(f"vocab_size ({self.vocab_size}) must be greater than hidden_size ({self.hidden_size})") |
| 46 | +``` |
| 47 | + |
| 48 | +Fields are validated during initialization: |
| 49 | + |
| 50 | +```python |
| 51 | +config = Config(model_type="bert", hidden_size=24) # Valid |
| 52 | +config = Config(model_type="bert", hidden_size=-1) # Raises StrictDataclassFieldValidationError |
| 53 | +``` |
| 54 | + |
| 55 | +Consistency between fields is also validated during initialization (class-wise validation): |
| 56 | + |
| 57 | +```python |
| 58 | +# `vocab_size` too small compared to `hidden_size` |
| 59 | +config = Config(model_type="bert", hidden_size=32, vocab_size=16) # Raises StrictDataclassClassValidationError |
| 60 | +``` |
| 61 | + |
| 62 | +Fields are also validated during assignment: |
| 63 | + |
| 64 | +```python |
| 65 | +config.hidden_size = 512 # Valid |
| 66 | +config.hidden_size = -1 # Raises StrictDataclassFieldValidationError |
| 67 | +``` |
| 68 | + |
| 69 | +To re-run class-wide validation after assignment, you must call `.validate` explicitly: |
| 70 | + |
| 71 | +```python |
| 72 | +config.validate() # Runs all class validators |
| 73 | +``` |
| 74 | + |
| 75 | +### Custom Validators |
| 76 | + |
| 77 | +You can attach multiple custom validators to fields using [`validated_field`]. A validator is a callable that takes a single argument and raises an exception if the value is invalid. |
| 78 | + |
| 79 | +```python |
| 80 | +from dataclasses import dataclass |
| 81 | +from huggingface_hub.dataclasses import strict, validated_field |
| 82 | + |
| 83 | +def multiple_of_64(value: int): |
| 84 | + if value % 64 != 0: |
| 85 | + raise ValueError(f"Value must be a multiple of 64, got {value}") |
| 86 | + |
| 87 | +@strict |
| 88 | +@dataclass |
| 89 | +class Config: |
| 90 | + hidden_size: int = validated_field(validator=[positive_int, multiple_of_64]) |
| 91 | +``` |
| 92 | + |
| 93 | +In this example, both validators are applied to the `hidden_size` field. |
| 94 | + |
| 95 | +### Additional Keyword Arguments |
| 96 | + |
| 97 | +By default, strict dataclasses only accept fields defined in the class. You can allow additional keyword arguments by setting `accept_kwargs=True` in the `@strict` decorator. |
| 98 | + |
| 99 | +```python |
| 100 | +from dataclasses import dataclass |
| 101 | +from huggingface_hub.dataclasses import strict |
| 102 | + |
| 103 | +@strict(accept_kwargs=True) |
| 104 | +@dataclass |
| 105 | +class ConfigWithKwargs: |
| 106 | + model_type: str |
| 107 | + vocab_size: int = 16 |
| 108 | + |
| 109 | +config = ConfigWithKwargs(model_type="bert", vocab_size=30000, extra_field="extra_value") |
| 110 | +print(config) # ConfigWithKwargs(model_type='bert', vocab_size=30000, *extra_field='extra_value') |
| 111 | +``` |
| 112 | + |
| 113 | +Additional keyword arguments appear in the string representation of the dataclass but are prefixed with `*` to highlight that they are not validated. |
| 114 | + |
| 115 | +### Integration with Type Hints |
| 116 | + |
| 117 | +Strict dataclasses respect type hints and validate them automatically. For example: |
| 118 | + |
| 119 | +```python |
| 120 | +from typing import List |
| 121 | +from dataclasses import dataclass |
| 122 | +from huggingface_hub.dataclasses import strict |
| 123 | + |
| 124 | +@strict |
| 125 | +@dataclass |
| 126 | +class Config: |
| 127 | + layers: List[int] |
| 128 | + |
| 129 | +config = Config(layers=[64, 128]) # Valid |
| 130 | +config = Config(layers="not_a_list") # Raises StrictDataclassFieldValidationError |
| 131 | +``` |
| 132 | + |
| 133 | +Supported types include: |
| 134 | +- Any |
| 135 | +- Union |
| 136 | +- Optional |
| 137 | +- Literal |
| 138 | +- List |
| 139 | +- Dict |
| 140 | +- Tuple |
| 141 | +- Set |
| 142 | + |
| 143 | +And any combination of these types. If your need more complex type validation, you can do it through a custom validator. |
| 144 | + |
| 145 | +### Class validators |
| 146 | + |
| 147 | +Methods named `validate_xxx` are treated as class validators. These methods must only take `self` as an argument. Class validators are run once during initialization, right after `__post_init__`. You can define as many of them as needed—they'll be executed sequentially in the order they appear. |
| 148 | + |
| 149 | +Note that class validators are not automatically re-run when a field is updated after initialization. To manually re-validate the object, you need to call `obj.validate()`. |
| 150 | + |
| 151 | +```py |
| 152 | +from dataclasses import dataclass |
| 153 | +from huggingface_hub.dataclasses import strict |
| 154 | + |
| 155 | +@strict |
| 156 | +@dataclass |
| 157 | +class Config: |
| 158 | + foo: str |
| 159 | + foo_length: int |
| 160 | + upper_case: bool = False |
| 161 | + |
| 162 | + def validate_foo_length(self): |
| 163 | + if len(self.foo) != self.foo_length: |
| 164 | + raise ValueError(f"foo must be {self.foo_length} characters long, got {len(self.foo)}") |
| 165 | + |
| 166 | + def validate_foo_casing(self): |
| 167 | + if self.upper_case and self.foo.upper() != self.foo: |
| 168 | + raise ValueError(f"foo must be uppercase, got {self.foo}") |
| 169 | + |
| 170 | +config = Config(foo="bar", foo_length=3) # ok |
| 171 | + |
| 172 | +config.upper_case = True |
| 173 | +config.validate() # Raises StrictDataclassClassValidationError |
| 174 | + |
| 175 | +Config(foo="abcd", foo_length=3) # Raises StrictDataclassFieldValidationError |
| 176 | +Config(foo="Bar", foo_length=3, upper_case=True) # Raises StrictDataclassFieldValidationError |
| 177 | +``` |
| 178 | + |
| 179 | +<Tip warning={true}> |
| 180 | + |
| 181 | +Method `.validate()` is a reserved name on strict dataclasses. |
| 182 | +To prevent unexpected behaviors, a [`StrictDataclassDefinitionError`] error will be raised if your class already defines one. |
| 183 | + |
| 184 | +</Tip> |
| 185 | + |
| 186 | +## API Reference |
| 187 | + |
| 188 | +### `@strict` |
| 189 | + |
| 190 | +The `@strict` decorator enhances a dataclass with strict validation. |
| 191 | + |
| 192 | +[[autodoc]] dataclasses.strict |
| 193 | + |
| 194 | +### `as_validated_field` |
| 195 | + |
| 196 | +Decorator to create a [`validated_field`]. Recommended for fields with a single validator to avoid boilerplate code. |
| 197 | + |
| 198 | +[[autodoc]] dataclasses.as_validated_field |
| 199 | + |
| 200 | +### `validated_field` |
| 201 | + |
| 202 | +Creates a dataclass field with custom validation. |
| 203 | + |
| 204 | +[[autodoc]] dataclasses.validated_field |
| 205 | + |
| 206 | +### Errors |
| 207 | + |
| 208 | +[[autodoc]] errors.StrictDataclassError |
| 209 | + |
| 210 | +[[autodoc]] errors.StrictDataclassDefinitionError |
| 211 | + |
| 212 | +[[autodoc]] errors.StrictDataclassFieldValidationError |
| 213 | + |
| 214 | +## Why Not Use `pydantic`? (or `attrs`? or `marshmallow_dataclass`?) |
| 215 | + |
| 216 | +- See discussion in https://github.com/huggingface/transformers/issues/36329 regarding adding Pydantic as a dependency. It would be a heavy addition and require careful logic to support both v1 and v2. |
| 217 | +- We don't need most of Pydantic's features, especially those related to automatic casting, jsonschema, serialization, aliases, etc. |
| 218 | +- We don't need the ability to instantiate a class from a dictionary. |
| 219 | +- We don't want to mutate data. In `@strict`, "validation" means "checking if a value is valid." In Pydantic, "validation" means "casting a value, possibly mutating it, and then checking if it's valid." |
| 220 | +- We don't need blazing-fast validation. `@strict` isn't designed for heavy loads where performance is critical. Common use cases involve validating a model configuration (performed once and negligible compared to running a model). This allows us to keep the code minimal. |
0 commit comments