From 7cf3b5d65a73c1e040bf1bacba2737529015c971 Mon Sep 17 00:00:00 2001 From: Nikita Stroganov Date: Fri, 9 Jun 2023 18:13:54 +0300 Subject: [PATCH 1/3] Add the first version of taint analysis docs --- docs/TaintAnalysis.md | 464 ++++++++++++++++++++++++++++++++---------- 1 file changed, 351 insertions(+), 113 deletions(-) diff --git a/docs/TaintAnalysis.md b/docs/TaintAnalysis.md index 2e098c21f2..92f4fa422b 100644 --- a/docs/TaintAnalysis.md +++ b/docs/TaintAnalysis.md @@ -1,156 +1,228 @@ # Taint analysis -[//]: # (TODO) - -## Configuration - -There are four _rule types_: sources, passes, cleaners and sinks. - -The rule of each type contains -* a method name, -* a method description - -### Method name - -Fully qualified _method names_ can be written in one line. +## Introduction to the technique + +Taint analysis is a method that allows you to track the spread of unverified external data on the program. Getting such +data into critical sections of the code can lead to various vulnerabilities, including SQL injection, cross-site +scripting +(XSS) and many others. Attackers can use these vulnerabilities to disrupt the correct operation of the system, obtain +confidential data or conduct other unauthorized operations. Taint analysis helps to find the described mistakes at the +compilation stage. + +The key idea of the approach is that any variable that can be changed by an external user stores a potential security +threat. +If this variable is used in some expression, then the value of this expression also becomes suspicious. +Further, the algorithm tracks and notifies of situations when any of these marked variables are used in dangerous +command execution, +for example, direct queries to the database or the computer's operating system. + +For its work, taint analysis requires a configuration in which program methods can be assigned one of the following +roles. + +- Taint source — a source of unverified data. + For example, it can be a method for reading user input or a method for obtaining a parameter of an incoming HTTP + request. + The Taint source execution result is called marked. The semantics of the method determines which mark will be applied + to the variable. + Moreover, the name of the mark can be completely arbitrary, since it is chosen by the one who writes the + configuration. + For example, it can be set that the `getPassword()` method marks its return value with a "sensitive-data" mark. +- Taint pass — a function that marks the return value taking into account the marks in its arguments. + Depending on the implementation, marks can be applied not only to the result of the method but also to the object + `this`, and on the input parameters. For example, it can be set that the concatenation + method `concat strings(String a, String b)` + marks its result with all marks from `a` and from `b`. +- Taint cleaner — a function that removes a given set of marks from the passed arguments. + Most often, this is some kind of validation method that verifies that the user has entered data in the expected + format. + For example, the `validate Email(String email)` method removes the XSS mark upon successful completion of the checks, + because now there is no unverified data in the `email` that can lead to vulnerability of cross-site + scripting. +- Taint sink — a receiver, some critical section of the application. + It can be a method of direct access to the database or file system, as well as any other potentially dangerous + operation. + For the Taint sink, you can set a list of marks. Variables with specified marks should not leak into this taint sink. + For example, if a value marked "sensitive-data" is passed to the logging function, which prints its arguments directly + to the console, then this is a developer mistake, since data leaks. + +The taint analysis algorithm scans the data flow graph, trying to detect a route between a method from a set of Taint +sources +and a method from Taint sinks. The UnitTesBot taint analysis is implemented inside the symbolic engine +to avoid a large number of false positives. + +### Example + +Consider an example of a simple function in which there is an SQL injection vulnerability: +if an attacker enters the string `"name'); DROP TABLE Employees; --"` into the variable name, then he will be able +to delete the Employees table along with all the data that was stored in it. + +```java +class Example { + void example(Connection connection) { + Scanner sc = new Scanner(System.in); + String name = sc.nextLine(); + Statement stmt = connection.createStatement(); + String query = "INSERT INTO Employees(name) VALUES('" + .concat(name) + .concat("')"); + stmt.executeUpdate(query); + } +} +``` -One can also define a full _method name_ using nested structure: the package name is specified first, -then the class name appears, and finally there is the method name itself. +For taint analysis, you must set the configuration. -Note that regular expressions in names are not supported. +- Taint source is a `java.util.Scanner.nextLine` method that adds a "user—input" mark to the returned value. +- Taint pass is a `java.lang.String.concat` method that passes the "user—input" marks through itself received + either from the first argument or from the object on which this method is called (`this`). +- Taint sink is a `java.sql.Statement.executeUpdate` method that checks variables marked "user-input". -### Method description +Any correct implementation of the taint analysis algorithm should detect a mistake in this code: the variable with +the mark "user-input" is passed to the `executeUpdate` (sink). -Method description constitutes the essence of a rule. -Here are method descriptions for the rules of each type: sources, passes, cleaners and sinks. +It is important to note that the duties of the algorithm do not include detecting specific data that an attacker could +enter to harm the program. It is only necessary to discover the route connecting the source and the sink. -#### Sources +## UnitTestBot implementation -The `add-to` field specifies the objects to be marked. -You can specify only one value here or a whole list. +There is no unified configuration format for taint analysis in the world and all static analyzers describe their +own way of configuration. Thus, we are going to do the same thing: to come up with a configuration scheme +where it would be possible to describe the rules: sources, passes, cleaners and sinks. -Possible values are: -* `this` -* `arg1` -* `arg2` -* ... -* `return` +### Configuration -The `marks` field specifies the `marks` that should be added to the objects from the `add-to` list. You can also specify only one mark here or a whole list. - -Example: +The general structure of the configuration format (based on YAML) is presented below. ```yaml sources: - - com: - - abc: - - method1: - signature: [ _, _, ] - add-to: return - marks: xss - - method1: - signature: [ , _, ] - add-to: [ this, return ] - marks: sql-injection - - bca.method2: - add-to: return - marks: [ sensitive-data, xss ] - - org.example.method3: - add-to: [ this, arg1, arg3 ] - marks: sensitive-data + - + - + - <...> +passes: + - + - <...> +cleaners: + - + - <...> +sinks: + - + - <...> ``` -#### Passes +That is, these are just lists of rules related to a specific type. -The `get-from` field specifies the sources of `marks`. +Each rule has a certain set of characteristics. -The `add-to` field specifies the objects that will be marked if any element from `get-from` has any mark. +- The unique identifier of the method that this rule describes. + It consists of the full name of the method, including the package name and class name, + as well as the signature of the method — a set of types of its arguments (the `signature` key in YAML). +- Some `conditions` that must be met during execution for the rule to work. +- A set of `marks` that the rule uses. +- A set of specific mark management actions that occur when a rule is triggered (`add-to`, `get-from`, `remove-from`, or + `check`, depending on the semantics of the rule). -The `marks` field specifies the actual marks that can be passed from one object to another. - -For all the keys listed above, there can be only one value or a whole list of values. -This is also true for similar keys in the sections below. - -Example: +Thus, one rule, for example, taint source, may look like ```yaml -passes: - - com.abc.method2: - get-from: [ this, arg1, arg3 ] - add-to: [ arg2, return ] - marks: sensitive-data +com.abc.ClassName.methodName: + signature: [ , _, ] + conditions: + arg1: + not: [ -1, 1 ] + add-to: [ this, arg2, return ] + marks: user-input ``` -#### Cleaners +This rule is defined for a method named `methodName` from the `ClassName` class, which is in the `com.abc` package. +The method takes exactly 3 arguments, the first of which has the `int` type, the second can be anything, +and the last has the `java.lang.Object` type. +The `signature` key may not be specified, then any `methodName` overload is considered appropriate. -The `remove-from` field specifies the objects the `marks` should be removed from. +Triggering occurs only when the first argument (`arg1`) is not equal to either -1 or 1, +which is specified by the `conditions` key (the list is interpreted as a logical OR). +This parameter is optional, if it is not present, no conditions will be checked. -The `marks` field specifies the marks to be removed. +The described source adds a "user-input" mark on the variables corresponding to `this`, `arg2` and `return`. +In other words, to the class object `ClassName` on which methodName is called, the second argument of the function +and the return value. Moreover, the `add-to` and `marks` keys can contain both a list and a single value — this is done +for more convenient use. -Example: +The other types of rules have the same syntax as the source, except for the `add-to` key. -```yaml -cleaners: - - com.example.pack.method7: - remove-from: this - marks: [ sensitive-data, sql-injection ] -``` +- Taint pass transfers marks from one set of objects to another, so two keys are defined for it: + `get-from` and `add-to`, respectively. The mark specified in `marks` are added on `add-to` if there is one in `get-from`. +- Taint cleaner removes marks from objects, so its key is called `remove-from`. +- Taint sink checks for the presence of some marks in variables, which locates under the `check` key. -#### Sinks +### Configuration details -The `check` field specifies the objects that will be checked for `marks`. +Fully qualified method names can be written in one line or using nested structure: the package name is specified first, +then the class name appears, and finally, there is the method name itself. -When one of the `marks` is found in one of the objects from the `check`, the analysis will report the problem found. +```yaml +- com.abc.def.Example.example: ... +``` -Example: +or ```yaml -sinks: - - org.example: - - log: - check: arg1 - marks: sensitive-data - - method17: - check: [ arg1, arg3 ] - marks: [ sql-injection, xss ] +- com: + - abc.def: + - Example.example: ... ``` -*** -For all the rule types, method descriptions can optionally contain -* a method signature, -* runtime conditions. +Note that regular expressions in names are not supported yet. -**Signature** +--- -Method `signature` (optional) is a list of _argument types_ (at compile time): +The `add-to`, `get-from`, `remove-from`, and `check` fields specify the objects (or entities) to be marked. +You can specify only one value here or a whole list. -`signature: [ , _, ]` +Possible values are: -Please note that -- the type name is written in `<>`, -- `_` means any type. +- `this` +- `arg1` +- `arg2` +- ... +- `return` -**Conditions** +--- -Runtime `conditions` (optional) are conditions that must be met to trigger this rule. +The user can define arbitrary mark names or specify an empty list (`[]`) if he means all possible marks. -To check the conformity to conditions, you can set: -* the specific values of method arguments -* or their runtime types. +```yaml +passes: + - java.lang.String.concat: + get-from: this + add-to: return + marks: [ user-input, sensitive-data, my-super-mark ] +``` -Values can be set for the following types: `boolean`, `int`, `float` or `string`. +or -The full type name must be specified in the angle brackets `<>`. +```yaml +passes: + - java.lang.String.concat: + get-from: this + add-to: return + marks: [ ] # all possible marks +``` -If you do not specify `conditions`, the rule will be triggered by any call of the method. +--- -If several rules are suitable for one method call, they will all be applied in some kind of order. +To check the conformity to `conditions`, you can set: -The `conditions` field specifies runtime conditions for arguments (`arg1`, `arg2`, ...). Conditions can also be specified for `this` and `return` (if it makes sense). +- the specific values of method arguments +- their runtime types. -The rule is triggered if all the specified conditions are met. +Values can be set for the following types: `boolean`, `int`, `float` or `string` (and also `null` value for all nullable +types). + +The full type name must be specified in the angle brackets `<>`. -Example: +The `conditions` field specifies runtime conditions for arguments (`arg1`, `arg2`, ...). +Conditions can also be specified for `this` and `return` if it makes sense. +For sinks there is no sense to check some conditions for return value, so such functionality is not supported. ```yaml conditions: @@ -166,8 +238,6 @@ Values and types can be negated using the `not` key, as well as combined using l Nesting is allowed. -Example: - ```yaml conditions: this: [ "in", "out" ] # this should be equal to either "in" or "out" @@ -178,6 +248,12 @@ conditions: arg4: [ "", { not: } ] # should be an empty string or not a string at all ``` +--- + +If several rules are suitable for one method call, they will all be applied in some kind of order. + +--- + ### Overall example ```yaml @@ -228,9 +304,10 @@ sinks: marks: [ sql-injection, xss ] ``` -## Usage examples +### Usage examples -`java.lang.System.getenv` is a source of the "environment" mark. There are two overloads of this method: with one string parameter and no parameters at all. We want to describe only the first overload: +`java.lang.System.getenv` is a source of the "environment" mark. There are two overloads of this method: with one string +parameter and no parameters at all. We want to describe only the first overload: ```yaml sources: @@ -240,7 +317,8 @@ sinks: marks: environment ``` -`java.lang.String.concat` is a pass-through only if `this` is marked and not equal to `""`, or if `arg1` is marked and not equal to `""`: +`java.lang.String.concat` is a pass-through only if `this` is marked and not equal to `""`, or if `arg1` is marked and +not equal to `""`: ```yaml passes: @@ -259,6 +337,20 @@ sinks: marks: sensitive-data ``` +If you want to define `+` operator for strings as taint pass, you should write the following rules: + +```yaml +passes: + - java.lang.StringBuilder.append: + get-from: arg1 + add-to: this + marks: [ ] + - java.lang.StringBuilder.toString: + get-from: this + add-to: return + marks: [ ] +``` + `java.lang.String.isEmpty` is a cleaner only if it returns `true`: ```yaml @@ -270,7 +362,8 @@ sinks: marks: [ sql-injection, xss ] ``` -Suppose that the `org.example.util.unsafe` method is a sink for the "environment" mark if the second argument is an `Integer` and equal to zero: +Suppose that the `org.example.util.unsafe` method is a sink for the "environment" mark if the second argument is +an `Integer` and equal to zero: ```yaml sinks: @@ -281,7 +374,7 @@ Suppose that the `org.example.util.unsafe` method is a sink for the "environment check: arg2 marks: environment ``` - + The configuration above checks the type at compile-time, but sometimes we want to check the type at runtime: ```yaml @@ -293,5 +386,150 @@ The configuration above checks the type at compile-time, but sometimes we want t check: arg2 marks: environment ``` - + Perhaps explicit `and` for `conditions` will be added in the future. + +### Algorithm implementation details + +The main idea of the implemented approach is that each symbolic variable is associated with a taint vector — a 64-bit +value, each bit `i` of which is responsible for the presence of a mark with the number `i` in this object. +After that, during the symbolic execution, these mappings are maintained and updated in accordance with +the classical taint analysis algorithm. + +The implementation mostly affect the `Traverser` and `Memory` classes. Also, the new classes `TaintMarkRegistry` +and `TaintMarkManager`. The diagram below shows a high-level architecture of the taint module (actually, in the code +it was written a little differently, but to understand the idea, the diagram is greatly simplified). + +[//]: # (TODO: picture) + +The `TaintMarkRegistry` class stores a mapping between the mark name and its ordinal number from 0 to 63. +The number of marks is limited to 64. However, firstly, this is enough for almost any reasonable example, +and secondly, the decision was made due to performance issues — operations with the `Long` data type are +performed much faster than if a bit array was used. + +The TaintModel component (data classes at `org.utbot.taint.model`) is responsible for providing access +to the configuration. In particular, it defines a way to convert conditions (the value of the `conditions` key +in a YAML document) into logical expressions over symbolic variables. + +`Memory` stores the values of the taint bit-vectors for symbolic variables. Only simple methods were implemented there +(functions to update vectors and get them at the address of a symbolic object). +All the complex logic of adding and removing marks, based on taint analysis theory, +was written in a separate class `TaintMarkManager`. In other words, this class wraps low-level memory work into +domain-friendly operations. + +The information about the marked variables is updated during the `Traverser` work. Before each of the `invoke()` +instruction, which corresponds to the launch of some method in the user code, a special `Traverser.processTaintSink` +handler is called, and after the `invoke` instruction, the `Traverser.processTaintSource`, `Traverser.processTaintPass` +and `Traverser.processTaintCleaner` handlers are called. This order is because all rules, except sinks, +need the result of the function. At the same time, the transfer fact of tainted data occurs at the time of launching +the sink function, therefore, you can report the vulnerability found even before it is executed. + +The listed rule handlers get the configuration and perform the taint analysis semantics. The `processTaintSink` method +requests information from the `TaintMarkManager` about the marks already set and adds constraints to the SMT solver, +the satisfiability of which corresponds to the detection of a defect. The other handlers modify the symbolic `Memory` +through the `TaintMarkManager`, adding and removing marks from the selected symbolic variables. + +### Code generator modification + +The result of the UnitTestBot (in addition to the SARIF report) are unit tests. `CodeGenerator` is launched on each +found test case, and generates the test (as Java code). Moreover, the test which leads to throwing an unhandled exception +should not pass. The taint analysis errors are not real from the point of view of the language, since they are not +real exceptions. However, it is still needed to highlight such tests as failed, so the code generator was modified +in such a way that an artificial error was added at the end of each test, which would ensure a fall (the same strategy +was used in the integer overflow detection). + +```java +fail("'java.lang.String' marked 'user-input' was passed into 'Example.example' method"); +``` + +The solution allows to automatically get integration with SARIF reports and visualization of results +in the Problems view tab in IntelliJ IDEA. The found test case is treated as a real exception, +and all the necessary logic has already been written for them. + +### Example + +Consider the code below + +```java +public class User { + + String getLogin() { /* some logic */ } + + String getPassword() { /* some logic */ } + + String userInfo(String login, String password) { + return login + "#" + password; + } + + void printUserInfo() { + var login = getLogin(); + var password = getPassword(); + var info = userInfo(login, password); + System.out.println(info); + } +} +``` + +The `getPassword` method returns sensitive data that should never leak out of the application, but the programmer prints +them to the `stdout`, which is a serious mistake. First, we will write a configuration that expresses the thought said, +and save it to the file `./.idea/utbot-taint-config.yaml`, from where the analyzer can read it. + +```yaml +sources: + - User.getPassword: + add-to: return + marks: sensitive-data + +passes: + - User.userInfo: + get-from: [ arg1, arg2 ] + add-to: return + marks: [ ] # all + +sinks: + - java.io.PrintStream.println: + check: arg1 + marks: sensitive-data +``` + +Then we enable taint analysis in settings and run the UnitTestBot using the IntelliJ IDEA plugin. + +[//]: # (TODO: picture) + +Generated code: + +```java +public final class UserTest { + // some omitted code + + ///region SYMBOLIC EXECUTION: TAINT ANALYSIS for method printUserInfo() + + @Test + @DisplayName("printUserInfo: System.out.println(info) : True -> DetectTaintAnalysisError") + public void testPrintUserInfo_PrintStreamPrintln_1() { + User user = new User(); + + user.printUserInfo(); + fail("'java.lang.String' marked 'sensitive-data' was passed into 'PrintStream.println' method"); + } + + ///endregion +} +``` + +Also, we can see the detected problem on the Problems tab: + +[//]: # (TODO: picture) + +**A Brief explanation**: After executing the `getPassword` method, the symbol corresponding to the password variable +is marked as "sensitive-data" (a zero bit is set to 1 in its taint vector). After calling `userInfo`, the `info` +variable is also marked, since `userInfo` is a taint pass that adds to the return value all the marks collected from +both of its arguments. Before printing `info` to the console, the `processTaintSink` handler function adds a constraint +to the SMT solver, the satisfiability of which corresponds to throwing our artificial error. The logical formula for +this path is satisfiable, so the analyzer reports an error detected, which we eventually observe. + +### Unit tests + +Taint analysis unit tests are at the `./utbot-framework-test/src/test/kotlin/org/utbot/examples/taint/` +directory. Tests use examples at the `utbot-sample/src/main/java/org/utbot/examples/taint`. Each example has its own +configuration file stored at the `utbot-sample/src/main/resources/taint`. From de4e40350661b735d3c38929d0bf905d48252ca5 Mon Sep 17 00:00:00 2001 From: Nikita Stroganov <54814796+IdeaSeeker@users.noreply.github.com> Date: Fri, 9 Jun 2023 18:19:45 +0300 Subject: [PATCH 2/3] Add pictures to taint analysis docs --- docs/TaintAnalysis.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/TaintAnalysis.md b/docs/TaintAnalysis.md index 92f4fa422b..0e19d10d83 100644 --- a/docs/TaintAnalysis.md +++ b/docs/TaintAnalysis.md @@ -400,7 +400,7 @@ The implementation mostly affect the `Traverser` and `Memory` classes. Also, the and `TaintMarkManager`. The diagram below shows a high-level architecture of the taint module (actually, in the code it was written a little differently, but to understand the idea, the diagram is greatly simplified). -[//]: # (TODO: picture) + The `TaintMarkRegistry` class stores a mapping between the mark name and its ordinal number from 0 to 63. The number of marks is limited to 64. However, firstly, this is enough for almost any reasonable example, @@ -494,7 +494,7 @@ sinks: Then we enable taint analysis in settings and run the UnitTestBot using the IntelliJ IDEA plugin. -[//]: # (TODO: picture) + Generated code: @@ -519,9 +519,9 @@ public final class UserTest { Also, we can see the detected problem on the Problems tab: -[//]: # (TODO: picture) + -**A Brief explanation**: After executing the `getPassword` method, the symbol corresponding to the password variable +**A brief explanation**: After executing the `getPassword` method, the symbol corresponding to the password variable is marked as "sensitive-data" (a zero bit is set to 1 in its taint vector). After calling `userInfo`, the `info` variable is also marked, since `userInfo` is a taint pass that adds to the return value all the marks collected from both of its arguments. Before printing `info` to the console, the `processTaintSink` handler function adds a constraint From 91b8dbd4386f252c6ed4d4794dd3e3ff63cb8adb Mon Sep 17 00:00:00 2001 From: Nikita Stroganov Date: Fri, 9 Jun 2023 18:42:36 +0300 Subject: [PATCH 3/3] fix typos --- docs/TaintAnalysis.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/TaintAnalysis.md b/docs/TaintAnalysis.md index 0e19d10d83..dea7d876e6 100644 --- a/docs/TaintAnalysis.md +++ b/docs/TaintAnalysis.md @@ -30,12 +30,12 @@ roles. - Taint pass — a function that marks the return value taking into account the marks in its arguments. Depending on the implementation, marks can be applied not only to the result of the method but also to the object `this`, and on the input parameters. For example, it can be set that the concatenation - method `concat strings(String a, String b)` + method `concat(String a, String b)` marks its result with all marks from `a` and from `b`. - Taint cleaner — a function that removes a given set of marks from the passed arguments. Most often, this is some kind of validation method that verifies that the user has entered data in the expected format. - For example, the `validate Email(String email)` method removes the XSS mark upon successful completion of the checks, + For example, the `validateEmail(String email)` method removes the XSS mark upon successful completion of the checks, because now there is no unverified data in the `email` that can lead to vulnerability of cross-site scripting. - Taint sink — a receiver, some critical section of the application. @@ -72,8 +72,8 @@ class Example { For taint analysis, you must set the configuration. -- Taint source is a `java.util.Scanner.nextLine` method that adds a "user—input" mark to the returned value. -- Taint pass is a `java.lang.String.concat` method that passes the "user—input" marks through itself received +- Taint source is a `java.util.Scanner.nextLine` method that adds a "user-input" mark to the returned value. +- Taint pass is a `java.lang.String.concat` method that passes the "user-input" marks through itself received either from the first argument or from the object on which this method is called (`this`). - Taint sink is a `java.sql.Statement.executeUpdate` method that checks variables marked "user-input". @@ -121,7 +121,7 @@ Each rule has a certain set of characteristics. - A set of specific mark management actions that occur when a rule is triggered (`add-to`, `get-from`, `remove-from`, or `check`, depending on the semantics of the rule). -Thus, one rule, for example, taint source, may look like +Thus, one rule, for example, taint source, can look like ```yaml com.abc.ClassName.methodName: @@ -213,7 +213,7 @@ passes: To check the conformity to `conditions`, you can set: - the specific values of method arguments -- their runtime types. +- their runtime types Values can be set for the following types: `boolean`, `int`, `float` or `string` (and also `null` value for all nullable types).