Skip to content

Clean up OverallArchitecture, Fuzzing and Logging docs #1873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 50 additions & 41 deletions docs/Fuzzing Platform.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,22 @@

**Problem:** fuzzing is a versatile technique for generating values to be used as method arguments. Normally,
to generate values, one needs information on a method signature, or rather on the parameter types (if a fuzzer is
able to "understand" them). _White-box_ approach also requires AST, and _grey-box_ approach needs coverage
able to "understand" them).
The _white-box_ approach also requires AST, and the _grey-box_ approach needs coverage
information. To generate values that may serve as method arguments, the fuzzer uses generators, mutators, and
predefined values.

* _Generators_ yield concrete objects created by descriptions. The basic description for creating objects is _type_.
Constants, regular expressions, and other structured object specifications (e.g. in HTML) may be also used as
Constants, regular expressions, and other structured object specifications (e.g. in HTML) may also be used as
descriptions.

* _Mutators_ modify the object in accordance with some logic that usually means random changes. To get better
results, mutators obtain feedback (information on coverage and the inner state of the
program) during method call.

* _Predefined values_ work well for known problems, e.g. incorrect symbol sequences. To discover potential problems one can analyze parameter names as well as the specific constructs or method calls inside the method body.
* _Predefined values_ work well for known problems, e.g. incorrect symbol sequences. To discover potential problems, one can analyze parameter names as well as the specific constructs or method calls inside the method body.

General API for using fuzzer looks like this:
The general API for using the fuzzer looks like this:

```
fuzz(
Expand All @@ -29,9 +30,14 @@ fuzz(
}
```

Fuzzer accepts list of types which can be provided in different formats: string, object or Class<*> in Java. Then seed
generator accepts these types and produces seeds which are used as base objects for value generation and mutations.
Fuzzing logic about how to choose, combine and mutate values from seed set is only fuzzing responsibility. API should not provide such abilities except general fuzzing configuring.
The fuzzer gets the list of types,
which can be provided in different formats: as a string, an object, or a Class<*> in Java.
The seed generator accepts these types and produces seeds.
The seeds are base objects for value generation and mutations.

It is the fuzzer, which is responsible for choosing, combining and mutating values from the seed set.
The fuzzer API should not provide access to the inner fuzzing logic.
Only general configuration is available.

## Parameters

Expand All @@ -42,26 +48,34 @@ The general fuzzing process gets the list of parameter descriptions as input and
```

In this particular case, the fuzzing process can generate the set of all the pairs having integer as the first value
and `true` or `false` as the second one. If values `-3, 0, 10` are generated to be the `Int` values, the set of all the possible combinations has six items: `(-3, false), (0, false), (10, false), (-3, true), (0, true), (10, true)`. Depending on the programming language, one may use interface descriptions or annotations (type hints) instead of defining the specific type. Fuzzing platform (FP) is not able to create the concrete objects as it does not deal with the specific languages. It still can convert the descriptions to the known constructs it can work with.

Say, in most of the programming languages, any integer may be represented as a bit array, and fuzzer can construct and
modify bit arrays. So, in general case, the boundary values for the integer are these bit arrays:

* [0, 0, 0, ..., 0] - null
* [1, 0, 0, ..., 0] - minimum value
* [0, 1, 1, ..., 1] - maximum value
* [0, 0, ..., 0, 1] - plus 1
* [1, 1, 1, ..., 1] - minus 1
and `true` or `false` as the second one.
If values `-3, 0, 10` are generated to be the `Int` values, the set of all the possible combinations has six items:
`(-3, false), (0, false), (10, false), (-3, true), (0, true), (10, true)`.
Depending on the programming language,
one may use interface descriptions or annotations (type hints) instead of defining the specific type.
Fuzzing platform (FP) is not able to create the concrete objects as it does not deal with the specific languages.
It can still convert the descriptions to the known constructs it can work with.

Say, in most of the programming languages, any integer may be represented as a bit array, and the fuzzer can construct and
modify bit arrays. So, in the general case, the boundary values for the integer are these bit arrays:

* [0, 0, 0, ..., 0] — zero
* [1, 0, 0, ..., 0] — minimum value
* [0, 1, 1, ..., 1] — maximum value
* [0, 0, ..., 0, 1] — plus 1
* [1, 1, 1, ..., 1] — minus 1

One can correctly use this representation for unsigned integers as well:

* [0, 0, 0, ..., 0] - null (minimum value)
* [1, 0, 0, ..., 0] - maximum value / 2
* [0, 1, 1, ..., 1] - maximum value / 2 + 1
* [0, 0, ..., 0, 1] - plus 1
* [1, 1, 1, ..., 1] - maximum value
* [0, 0, 0, ..., 0] — zero (minimum value)
* [1, 0, 0, ..., 0] maximum value / 2
* [0, 1, 1, ..., 1] maximum value / 2 + 1
* [0, 0, ..., 0, 1] plus 1
* [1, 1, 1, ..., 1] maximum value

Thus, FP interprets the _Byte_ and _Unsigned Byte_ descriptions in different ways: in the former case, the maximum value is [0, 1, 1, 1, 1, 1, 1, 1], while in the latter case it is [1, 1, 1, 1, 1, 1, 1, 1]. FP types are described in details further.
Thus, FP interprets the _Byte_ and _Unsigned Byte_ descriptions in different ways: in the former case,
the maximum value is [0, 1, 1, 1, 1, 1, 1, 1], while in the latter case it is [1, 1, 1, 1, 1, 1, 1, 1].
FP types are described in detail further.

## Refined parameter description

Expand All @@ -79,19 +93,21 @@ public boolean isNaN(Number n) {
In the above example, let the parameter be `Integer`. Considering the feedback, the fuzzer suggests that nothing but `Double` might increase coverage, so the type may be downcasted to `Double`. This allows for filtering out a priori unfitting values.

## Statically and dynamically generated values
Predefined, or _statically_ generated, values help to define the initial range of values, which could be used as method arguments. These values allow us to:
Predefined, or _statically_ generated, values help to define the initial range of values, which could be used as method arguments.

* check if it is possible to call the given method with at least some set of values as arguments,
* gather statistics on executing the program,
These values allow us to:
* check if it is possible to call the given method with at least some set of values as arguments;
* gather statistics on executing the program;
* refine the parameter description.

_Dynamic_ values are generated in two ways:

* internally — via mutating the existing values, successfully performed as method arguments (i.e. seeds);
* externally — via obtaining feedback that can return not only the statistics on the execution (the paths explored,
* internally, via mutating the existing values, successfully performed as method arguments (i.e. seeds);
* externally, via obtaining feedback that can return not only the statistics on the execution (the paths explored,
the time spent, etc.) but also the set of new values to be blended with the values already in use.

Dynamic values should have the higher priority for a sample, that's why they should be chosen either first or at least more likely than the statically generated ones. In general, the algorithm that guides the fuzzing process looks like this:
Dynamic values should have a higher priority for a sample;
that is why they should be chosen either first or at least more likely than the statically generated ones.
In general, the algorithm that guides the fuzzing process looks like this:

```
# dynamic values are stored with respect to their return priority
Expand Down Expand Up @@ -135,7 +151,6 @@ Sometimes it is reasonable to modify the source code so that it makes applying f
## Generators

There are two types of generators:

* yielding values of primitive data types: integers, strings, booleans
* yielding values of recursive data types: objects, lists

Expand All @@ -146,39 +161,33 @@ three
modifications for it using `put(key, value)`. For this purpose, you may request for applying the fuzzer to six
parameters `(key, value, key, value, key, value)` and get the necessary modified values.

Primitive type generators allow for yielding

Primitive type generators allow for yielding:
1. Signed integers of a given size (8, 16, 32, and 64 bits, usually)
2. Unsigned integers of a given size
3. Floating-point numbers with a given size of significand and exponent according to IEEE 754
4. Booleans: _True_ and _False_
5. Characters (in UTF-16 format)
6. Strings (consisting of UTF-16 characters)

Fuzzer should be able to provide out-of-the-box support for these types — be able to create, modify, and process
them. To work with multiple languages it is enough to specify the possible type size and to describe and create the
The fuzzer should be able to provide out-of-the-box support for these types — be able to create, modify, and process
them.
To work with multiple languages, it is enough to specify the possible type size and to describe and create
concrete objects based on the FP-generated values.

The recursive types include two categories:

* Collections (arrays and lists)
* Objects

Collections may be nested and have _n_ dimensions (one, two, three, or more).

Collections may be:

* of a fixed size (e.g., arrays)
* of a variable size (e.g., lists and dictionaries)

Objects may have:

1. Constructors with parameters

2. Modifiable inner fields

3. Modifiable global values (the static ones)

4. Calls for modifying methods

FP should be able to create and describe such objects in the form of a tree. The semantics of actual modifications is under the responsibility of a programming language.
Expand Down
29 changes: 22 additions & 7 deletions docs/OverallArchitecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ sequenceDiagram
The plugin provides
* a UI for the IntelliJ-based IDEs to use UnitTestBot directly from source code,
* the linkage between IntelliJ Platform API and UnitTestBot API,
* support for the most popular programming languages and frameworks for end users (the plugin and its optional dependencies are described in [plugin.xml](https://github.com/UnitTestBot/UTBotJava/blob/main/utbot-intellij/src/main/resources/META-INF/plugin.xml) and nearby, in the [`META-INF`](https://github.com/UnitTestBot/UTBotJava/tree/main/utbot-intellij/src/main/resources/META-INF) folder.
* support for the most popular programming languages and frameworks for end users (the plugin and its optional dependencies are described in [plugin.xml](https://github.com/UnitTestBot/UTBotJava/blob/main/utbot-intellij/src/main/resources/META-INF/plugin.xml) and nearby, in the [`META-INF`](https://github.com/UnitTestBot/UTBotJava/tree/main/utbot-intellij/src/main/resources/META-INF) folder).

The main plugin module is [utbot-intellij](https://github.com/UnitTestBot/UTBotJava/tree/main/utbot-intellij), providing support for Java and Kotlin.
Also, there is an auxiliary [utbot-ui-commons](https://github.com/UnitTestBot/UTBotJava/tree/main/utbot-ui-commons) module to support providers for other languages.
Expand All @@ -124,7 +124,7 @@ As for the UI, there are two entry points:
The main plugin-specific features are:
* A common action for generating tests right from the editor or a project tree — with a generation scope from a single method up to the whole source root. See [GenerateTestAction](https://github.com/UnitTestBot/UTBotJava/blob/main/utbot-intellij/src/main/kotlin/org/utbot/intellij/plugin/ui/actions/GenerateTestsAction.kt) — the same for all supported languages.
* Auto-installation of the user-chosen testing framework as a project library dependency (JUnit 4, JUnit 5, and TestNG are supported). See [UtIdeaProjectModelModifier](https://github.com/UnitTestBot/UTBotJava/blob/main/utbot-intellij/src/main/kotlin/org/utbot/intellij/plugin/util/UtIdeaProjectModelModifier.kt) and the Maven-specific version: [UtMavenProjectModelModifier](https://github.com/UnitTestBot/UTBotJava/blob/main/utbot-intellij/src/main/kotlin/org/utbot/intellij/plugin/util/UtMavenProjectModelModifier.kt).
* Suggesting the location for a test source root and auto-generating the `utbot_tests` folder there, providing users with a sandbox in their codespace.
* Suggesting the location for a test source root and auto-generating the `utbot_tests` folder there, providing users with a sandbox in their code space.
* Optimizing generated code with IDE-provided intentions (experimental). See [IntentionHelper](https://github.com/UnitTestBot/UTBotJava/blob/main/utbot-intellij/src/main/kotlin/org/utbot/intellij/plugin/generator/IntentionHelper.kt) for details.
* An option for distributing generation time between symbolic execution and fuzzing explicitly.
* Running generated tests while showing coverage with the IDE-provided measurement tools. See [RunConfigurationHelper](https://github.com/UnitTestBot/UTBotJava/blob/main/utbot-intellij/src/main/kotlin/org/utbot/intellij/plugin/util/RunConfigurationHelper.kt) for implementation.
Expand Down Expand Up @@ -241,7 +241,10 @@ The main instrumentation of UnitTestBot is [UtExecutionInstrumentation](https://
### Code generator

Code generation and rendering are a part of the test generation process in UnitTestBot.
UnitTestBot gets the synthetic representation of generated test cases from the fuzzer or the symbolic engine. This representation, or model, is implemented in the `UtExecution` class. The `codegen` module generates the real test code based on this `UtExecution` model and renders it in a human-readable form.
UnitTestBot gets the synthetic representation of generated test cases from the fuzzer or the symbolic engine.
This representation (or model) is implemented in the `UtExecution` class.
The `codegen` module generates the real test code based on this `UtExecution` model
and renders it in a human-readable form.

The `codegen` module
- converts `UtExecution` test information into an Abstract Syntax Tree (AST) representation using `CodeGenerator`,
Expand Down Expand Up @@ -287,7 +290,7 @@ To minimize the number of executions in a group, we use a simple greedy algorith
2. Add this execution to the final suite and mark new lines as covered.
3. Repeat the first step and continue till there are executions containing uncovered lines.

The whole minimization procedure is located in the [org.utbopt.framework.minimization](utbot-framework/src/main/kotlin/org/utbot/framework/minimization) package inside the [utbot-framework](../utbot-framework) module.
The whole minimization procedure is located in the [org.utbot.framework.minimization](../utbot-framework/src/main/kotlin/org/utbot/framework/minimization) package inside the [utbot-framework](../utbot-framework) module.

### Summarization module

Expand All @@ -309,7 +312,7 @@ For detailed information, please refer to the Summarization architecture design

### SARIF report generator

SARIF (Static Analysis Results Interchange Format) is a JSONbased format for displaying static analysis results.
SARIF (Static Analysis Results Interchange Format) is a JSON-based format for displaying static analysis results.

All the necessary information about the format and its usage can be found
in the [official documentation](https://github.com/microsoft/sarif-tutorials/blob/main/README.md)
Expand Down Expand Up @@ -346,7 +349,8 @@ UnitTestBot consists of three processes (according to the execution order):

These processes are built on top of the [Reactive distributed communication framework (Rd)](https://github.com/JetBrains/rd) developed by JetBrains.

One of the main Rd concepts is _Lifetime_ — it helps to release shared resources upon the object's termination. You can find the Rd basic ideas and UnitTestBot implementation details in the [Multiprocess architecture](https://github.com/UnitTestBot/UTBotJava/blob/main/docs/RD%20for%20UnitTestBot.md) design doc.
One of the main Rd concepts is a _Lifetime_ — it helps to release shared resources upon the object's termination.
You can find the Rd basic ideas and UnitTestBot implementation details in the [Multiprocess architecture](https://github.com/UnitTestBot/UTBotJava/blob/main/docs/RD%20for%20UnitTestBot.md) design doc.

### Settings

Expand All @@ -362,4 +366,15 @@ The end user has three places to change UnitTestBot behavior:
3. Controls in the **Generate Tests with UnitTestBot window** dialog — for per-generation settings.

### Logging
TODO

The UnitTestBot Java logging system is implemented across the IDE process, the Engine process, and the Instrumented process.

UnitTestBot Java logging relies on `log4j2` library.
The custom Rd logging system is recommended as the default one for the Instrumented process.

In the [Logging](../docs/contributing/InterProcessLogging.md) document,
you can find how to configure the logging system when UnitTestBot Java is used
* as an IntelliJ IDEA plugin,
* as Contest estimator or the Gradle/Maven plugins, via CLI or during the CI test runs.

Implementation details, log level and performance questions are also addressed [here](../docs/contributing/InterProcessLogging.md).
4 changes: 2 additions & 2 deletions docs/RD for UnitTestBot.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ executing all the callbacks because some other thread executes them.
Rd is a lightweight reactive one-to-one RPC protocol, which is cross-language as well as cross-platform. It can
work on the same or different machines via the Internet.

These are some of Rd entities:
These are some Rd entities:
- `Protocol` encapsulates the logic of all Rd communications. All the entities should be bound to `Protocol` before
being used. `Protocol` contains `IScheduler`, which executes a _runnable_ instance on a different thread.
- `RdSignal` is an entity allowing one to **fire and forget**. You can add a callback for every received message
Expand Down Expand Up @@ -228,7 +228,7 @@ Sometimes the _Instrumented process_ may unexpectedly die due to concrete execut
- **Important**: do not add [`Rdgen`](https://mvnrepository.com/artifact/com.jetbrains.rd/rd-gen) as
an implementation dependency — it breaks some JAR files as it contains `kotlin-compiler-embeddable`.
5. Logging & debugging:
- [Interprocess logging](./InterProcessLogging.md)
- [Interprocess logging](contributing/InterProcessLogging.md)
- [Interprocess debugging](./contributing/InterProcessDebugging.md)
6. Custom protocol marshaling types: do not spend time on it until `UtModels` get simpler, e.g. compatible with
`kotlinx.serialization`.
Expand Down
2 changes: 1 addition & 1 deletion docs/contributing/InterProcessDebugging.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ To debug the _Engine process_ and the _Instrumented process_, you need to enable
"-agentlib:jdwp=transport=dt_socket,server=n,suspend=n,quiet=y,address=12345"
```
See `org.utbot.intellij.plugin.process.EngineProcess.Companion.debugArgument` for switch implementation.
4. For information about logs, refer to the [Interprocess logging](../InterProcessLogging.md) guide.
4. For information about logs, refer to the [Interprocess logging](InterProcessLogging.md) guide.

### Run configurations for debugging the Engine process

Expand Down
Loading