Skip to content

Commit 3542afb

Browse files
authored
New fuzzing platform (#1457)
1 parent 1042fb7 commit 3542afb

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+4107
-299
lines changed

docs/Fuzzing Platform.md

Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
# Fuzzing Platform (FP) Design
2+
3+
**Problem:** fuzzing is a versatile technique for generating values to be used as method arguments. Normally,
4+
to generate values, one needs information on a method signature, or rather on the parameter types (if a fuzzer is
5+
able to "understand" them). _White-box_ approach also requires AST, and _grey-box_ approach needs coverage
6+
information. To generate values that may serve as method arguments, the fuzzer uses generators, mutators, and
7+
predefined values.
8+
9+
* _Generators_ yield concrete objects created by descriptions. The basic description for creating objects is _type_.
10+
Constants, regular expressions, and other structured object specifications (e.g. in HTML) may be also used as
11+
descriptions.
12+
13+
* _Mutators_ modify the object in accordance with some logic that usually means random changes. To get better
14+
results, mutators obtain feedback (information on coverage and the inner state of the
15+
program) during method call.
16+
17+
* _Predefined values_ work well for known problems, e.g. incorrect symbol sequences. To discover potential problems one can analyze parameter names as well as the specific constructs or method calls inside the method body.
18+
19+
General API for using fuzzer looks like this:
20+
21+
```
22+
fuzz(
23+
params = "number", "string", "object<object, number>: number, string",
24+
seedGenerator = (type: Type) -> seeds
25+
details: (constants, providers, etc)
26+
).forEveryGeneratedValues { values: List ->
27+
feedback = exec(values);
28+
return feedback
29+
}
30+
```
31+
32+
Fuzzer accepts list of types which can be provided in different formats: string, object or Class<*> in Java. Then seed
33+
generator accepts these types and produces seeds which are used as base objects for value generation and mutations.
34+
Fuzzing logic about how to choose, combine and mutate values from seed set is only fuzzing responsibility. API should not provide such abilities except general fuzzing configuring.
35+
36+
## Parameters
37+
38+
The general fuzzing process gets the list of parameter descriptions as input and returns the corresponding list of values. The simplest description is the specific object type, for example:
39+
40+
```kotlin
41+
[Int, Bool]
42+
```
43+
44+
In this particular case, the fuzzing process can generate the set of all the pairs having integer as the first value
45+
and `true` or `false` as the second one. If values `-3, 0, 10` are generated to be the `Int` values, the set of all the possible combinations has six items: `(-3, false), (0, false), (10, false), (-3, true), (0, true), (10, true)`. Depending on the programming language, one may use interface descriptions or annotations (type hints) instead of defining the specific type. Fuzzing platform (FP) is not able to create the concrete objects as it does not deal with the specific languages. It still can convert the descriptions to the known constructs it can work with.
46+
47+
Say, in most of the programming languages, any integer may be represented as a bit array, and fuzzer can construct and
48+
modify bit arrays. So, in general case, the boundary values for the integer are these bit arrays:
49+
50+
* [0, 0, 0, ..., 0] - null
51+
* [1, 0, 0, ..., 0] - minimum value
52+
* [0, 1, 1, ..., 1] - maximum value
53+
* [0, 0, ..., 0, 1] - plus 1
54+
* [1, 1, 1, ..., 1] - minus 1
55+
56+
One can correctly use this representation for unsigned integers as well:
57+
58+
* [0, 0, 0, ..., 0] - null (minimum value)
59+
* [1, 0, 0, ..., 0] - maximum value / 2
60+
* [0, 1, 1, ..., 1] - maximum value / 2 + 1
61+
* [0, 0, ..., 0, 1] - plus 1
62+
* [1, 1, 1, ..., 1] - maximum value
63+
64+
Thus, FP interprets the _Byte_ and _Unsigned Byte_ descriptions in different ways: in the former case, the maximum value is [0, 1, 1, 1, 1, 1, 1, 1], while in the latter case it is [1, 1, 1, 1, 1, 1, 1, 1]. FP types are described in details further.
65+
66+
## Refined parameter description
67+
68+
During the fuzzing process, some parameters get the refined description, for example:
69+
70+
```
71+
public boolean isNaN(Number n) {
72+
if (!(n instanceof Double)) {
73+
return false;
74+
}
75+
return Double.isNaN((Double) n);
76+
}
77+
```
78+
79+
In the above example, let the parameter be `Integer`. Considering the feedback, the fuzzer suggests that nothing but `Double` might increase coverage, so the type may be downcasted to `Double`. This allows for filtering out a priori unfitting values.
80+
81+
## Statically and dynamically generated values
82+
Predefined, or _statically_ generated, values help to define the initial range of values, which could be used as method arguments. These values allow us to:
83+
84+
* check if it is possible to call the given method with at least some set of values as arguments,
85+
* gather statistics on executing the program,
86+
* refine the parameter description.
87+
88+
_Dynamic_ values are generated in two ways:
89+
90+
* internally — via mutating the existing values, successfully performed as method arguments (i.e. seeds);
91+
* externally — via obtaining feedback that can return not only the statistics on the execution (the paths explored,
92+
the time spent, etc.) but also the set of new values to be blended with the values already in use.
93+
94+
Dynamic values should have the higher priority for a sample, that's why they should be chosen either first or at least more likely than the statically generated ones. In general, the algorithm that guides the fuzzing process looks like this:
95+
96+
```
97+
# dynamic values are stored with respect to their return priority
98+
dynamic_values = empty_priority_queue()
99+
# static values are generated beforehand
100+
static_values = generate()
101+
# "good" values
102+
seeded_values = []
103+
#
104+
filters = []
105+
106+
# the loop runs until coverage reaches 100%
107+
while app.should_fuzz(seeded_values.feedbacks):
108+
# first we choose all dynamic values
109+
# if there are no dynamic values, choose the static ones
110+
value = dynamic_values.take() ?: static_values.take_random()
111+
# if there is no value or it was filtered out (static values are generated in advance — they can be random and unfitting), try to generate new values via mutating the seeds
112+
if value is null or filters.is_filtered(value):
113+
value = mutate(seeded_values.random_value())
114+
# if there is still no value at this point, it means that there are no appropriate values at all, and the process stops
115+
if value is null: break
116+
117+
# run with the given values and obtain feedback
118+
feedback = yield_run(value)
119+
# feedback says if it is reasonable to add the current value to the set of seeds
120+
if feedback is good:
121+
seeded_values[feedback] += value
122+
# feedback may also provide fuzzer with the new values
123+
if feedback has suggested_value:
124+
dynamic_values += feedback.suggested_values() with high_priority
125+
126+
# mutate the static value thus allowing fuzzer to alternate static and dynamic values
127+
if value.is_static_generated:
128+
dynamic_values += mutate(seeded_values.random_value()) with low_priority
129+
```
130+
131+
## Helping fuzzer via code modification
132+
133+
Sometimes it is reasonable to modify the source code so that it makes applying fuzzer to it easier. This is one of possible approaches: to split the complex _if_-statement into the sequence of simpler _if_-statements. See [Circumventing Fuzzing Roadblocks with Compiler Transformations](https://lafintel.wordpress.com/2016/08/15/circumventing-fuzzing-roadblocks-with-compiler-transformations/) for details.
134+
135+
## Generators
136+
137+
There are two types of generators:
138+
139+
* yielding values of primitive data types: integers, strings, booleans
140+
* yielding values of recursive data types: objects, lists
141+
142+
Sometimes it is necessary not only to create an object but to modify it as well. We can apply fuzzing to
143+
the fuzzer-generated values that should be modified. For example, you have the `HashMap.java` class, and you need to
144+
generate
145+
three
146+
modifications for it using `put(key, value)`. For this purpose, you may request for applying the fuzzer to six
147+
parameters `(key, value, key, value, key, value)` and get the necessary modified values.
148+
149+
Primitive type generators allow for yielding
150+
151+
1. Signed integers of a given size (8, 16, 32, and 64 bits, usually)
152+
2. Unsigned integers of a given size
153+
3. Floating-point numbers with a given size of significand and exponent according to IEEE 754
154+
4. Booleans: _True_ and _False_
155+
5. Characters (in UTF-16 format)
156+
6. Strings (consisting of UTF-16 characters)
157+
158+
Fuzzer should be able to provide out-of-the-box support for these types — be able to create, modify, and process
159+
them. To work with multiple languages it is enough to specify the possible type size and to describe and create the
160+
concrete objects based on the FP-generated values.
161+
162+
The recursive types include two categories:
163+
164+
* Collections (arrays and lists)
165+
* Objects
166+
167+
Collections may be nested and have _n_ dimensions (one, two, three, or more).
168+
169+
Collections may be:
170+
171+
* of a fixed size (e.g., arrays)
172+
* of a variable size (e.g., lists and dictionaries)
173+
174+
Objects may have:
175+
176+
1. Constructors with parameters
177+
178+
2. Modifiable inner fields
179+
180+
3. Modifiable global values (the static ones)
181+
182+
4. Calls for modifying methods
183+
184+
FP should be able to create and describe such objects in the form of a tree. The semantics of actual modifications is under the responsibility of a programming language.
185+
186+
187+
## Typing
188+
189+
FP does not use the concept of _type_ for creating objects. Instead, FP introduces the _task_ concept — it
190+
encapsulates the description of a type, which should be used to create an object. Generally, this task consists of two
191+
blocks: the task for initializing values and the list of tasks for modifying the initialized value.
192+
193+
```
194+
Task = [
195+
Initialization: [T1, T2, T3, ..., TN]
196+
Modification(Initialization): [
197+
М1: [T1, T2, ..., TK],
198+
М2: [T1, T2, ..., TJ],
199+
МH: [T1, T2, ..., TI],
200+
]
201+
]
202+
```
203+
204+
Thus, we can group the tasks as follows:
205+
206+
```
207+
1. Trivial task = [
208+
Initialization: [INT|UNSIGNED.INT|FLOAT.POINT.NUMBER|BOOLEAN|CHAR|STRING]
209+
Modification(Initialization): []
210+
]
211+
212+
213+
2. Task for creating an array = [
214+
Initialization: [UNSIGNED.INT]
215+
Modification(UNSIGNED.INT) = [T] * UNSIGNED.INT
216+
]
217+
218+
or
219+
220+
2. Task for creating an array = [
221+
Initialization: [UNSIGNED.INT]
222+
Modification(UNSIGNED.INT) = [[T * UNSIGNED.INT]]
223+
]
224+
225+
where "*" means repeating the type the specified number of times
226+
227+
3. Task for creating an object = [
228+
Initialization: [Т1, Т2, ... ТN],
229+
Modification(UNSIGNED.INT) = [
230+
...
231+
]
232+
]
233+
234+
```
235+
236+
Therefore, each programming language defines how to interpret a certain type and how to infer it. This allows fuzzer
237+
to store and mutate complex objects without any additional support from the language.

settings.gradle.kts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ include("utbot-framework-api")
2222
include("utbot-intellij")
2323
include("utbot-sample")
2424
include("utbot-fuzzers")
25+
include("utbot-fuzzing")
2526
include("utbot-junit-contest")
2627
include("utbot-analytics")
2728
include("utbot-analytics-torch")

0 commit comments

Comments
 (0)