diff --git a/README.md b/README.md
index 871842de..d2fbd910 100644
--- a/README.md
+++ b/README.md
@@ -1,9 +1,10 @@
-# CodeFuse-Query A Data-Centric Static Code Analysis System
-
+# CodeFuse-Query: A Data-Centric Static Code Analysis System
+

-
+
-
+
+
@@ -22,9 +23,11 @@
-
-
-[中文文档](./README_zh.md)
+
+
+ [[中文]](README_cn.md) | [**English**]
+
+
## What is CodeFuse-Query?
In the domain of large-scale software development, the demands for dynamic and multifaceted static code analysis exceed the capabilities of traditional tools. To bridge this gap, we present CodeFuse-Query, a system that redefines static code analysis through the fusion of Domain Optimized System Design and Logic Oriented Computation Design.
@@ -66,12 +69,12 @@ Note: The maturity level of the language status is determined based on the types
[Installation, Configuration, and Running](./doc/3_install_and_run.md)
## Documentation
-- [Abstract](./doc/1_abstract.md)
-- [Introduction](./doc/2_introduction.md)
+- [Abstract](./doc/1_abstract.en.md)
+- [Introduction](./doc/2_introduction.en.md)
- [User Case](./doc/user_case.en.md)
-- [Installation, Configuration, and Running](./doc/3_install_and_run.md)
-- [GödelScript Query Language](./doc/4_godelscript_language.md)
-- [Developing Plugins (VSCode)](./doc/5_toolchain.md)
+- [Installation, Configuration, and Running](./doc/3_install_and_run.en.md)
+- [GödelScript Query Language](./doc/4_godelscript_language.en.md)
+- [Developing Plugins (VSCode)](./doc/5_toolchain.en.md)
- [COREF API](https://codefuse-ai.github.io/CodeFuse-Query/godel-api/coref_library_reference.html)
## Tutorial
diff --git a/README_zh.md b/README_cn.md
similarity index 98%
rename from README_zh.md
rename to README_cn.md
index f1a4da8f..30bcee63 100644
--- a/README_zh.md
+++ b/README_cn.md
@@ -3,7 +3,8 @@
-
+
+
@@ -22,7 +23,11 @@
-
+
+
+ [**中文**] | [English](README.md)
+
+
[English Documentation](./README.md)
diff --git a/doc/1_abstract.en.md b/doc/1_abstract.en.md
new file mode 100644
index 00000000..2a873008
--- /dev/null
+++ b/doc/1_abstract.en.md
@@ -0,0 +1,17 @@
+# Abstract
+With the increasing popularity of large-scale software development, the demand for scalable and adaptable static code analysis techniques is growing. Traditional static analysis tools such as Clang Static Analyzer (CSA) or PMD have shown good results in checking programming rules or style issues. However, these tools are often designed for specific objectives and are unable to meet the diverse and changing needs of modern software development environments. These needs may relate to Quality of Service (QoS), various programming languages, different algorithmic requirements, and various performance needs. For example, a security team might need sophisticated algorithms like context-sensitive taint analysis to review smaller codebases, while project managers might need a lighter algorithm, such as one that calculates cyclomatic complexity, to measure developer productivity on larger codebases.
+
+These diversified needs, coupled with the common computational resource constraints in large organizations, pose a significant challenge. Traditional tools, with their problem-specific computation methods, often fail to scale in such environments. This is why we introduced CodeQuery, a centralized data platform specifically designed for large-scale static analysis.
+In implementing CodeQuery, we treat source code and analysis results as data, and the execution process as big data processing, a significant departure from traditional tool-centric approaches. We leverage common systems in large organizations, such as data warehouses, data computation facilities like MaxCompute and Hive, OSS object storage, and flexible computing resources like Kubernetes, allowing CodeQuery to integrate seamlessly into these systems. This approach makes CodeQuery highly maintainable and scalable, capable of supporting diverse needs and effectively addressing changing demands. Furthermore, CodeQuery's open architecture encourages interoperability between various internal systems, facilitating seamless interaction and data exchange. This level of integration and interaction not only increases the degree of automation within the organization but also improves efficiency and reduces the likelihood of manual errors. By breaking down information silos and fostering a more interconnected, automated environment, CodeQuery significantly enhances the overall productivity and efficiency of the software development process.
+Moreover, CodeQuery's data-centric approach offers unique advantages when addressing domain-specific challenges in static source code analysis. For instance, source code is typically a highly structured and interconnected dataset, with strong informational and relational ties to other code and configuration files. By treating code as data, CodeQuery can adeptly handle these issues, making it especially suitable for use in large organizations where codebases evolve continuously but incrementally, with most code undergoing minor changes daily while remaining stable. CodeQuery also supports use cases like code-data based Business Intelligence (BI), generating reports and dashboards to aid in monitoring and decision-making processes. Additionally, CodeQuery plays an important role in analyzing training data for large language models (LLMs), providing deep insights to enhance the overall effectiveness of these models.
+
+In the current field of static analysis, CodeQuery introduces a new paradigm. It not only meets the needs of analyzing large, complex codebases but is also adaptable to the ever-changing and diversified scenarios of static analysis. CodeQuery's data-centric approach gives it a unique advantage in dealing with code analysis issues in big data environments. Designed to address static analysis problems in large-scale software development settings, it views both source code and analysis results as data, allowing it to integrate flexibly into various systems within large organizations. This approach not only enables efficient handling of large codebases but can also accommodate various complex analysis needs, thereby making static analysis work more effective and accurate.
+
+The characteristics and advantages of CodeQuery can be summarized as follows:
+
+- **Highly Scalable**: CodeQuery can handle large codebases and adapt to different analysis needs. This high level of scalability makes CodeQuery particularly valuable in large organizations.
+- **Data-Centric**: By treating source code and analysis results as data, CodeQuery's data-centric approach gives it a distinct edge in addressing code analysis problems in big data environments.
+- **Highly Integrated**: CodeQuery can integrate seamlessly into various systems within large organizations, including data warehouses, data computation facilities, object storage, and flexible computing resources. This high level of integration makes the use of CodeQuery in large organizations more convenient and efficient.
+- **Supports Diverse Needs**: CodeQuery can process large codebases and accommodate various complex analysis needs, including QoS analysis, cross-language analysis, algorithmic needs, and performance requirements.
+
+CodeQuery is a powerful static code analysis platform, suitable for large-scale, complex codebase analysis scenarios. Its data-centric approach and high scalability give it a unique advantage in the modern software development environment. As static code analysis technology continues to evolve, CodeQuery is expected to play an increasingly important role in this field.
\ No newline at end of file
diff --git a/doc/1_abstract.md b/doc/1_abstract.md
index d21dc424..8a825676 100644
--- a/doc/1_abstract.md
+++ b/doc/1_abstract.md
@@ -1,3 +1,4 @@
+# 引言
随着大规模软件开发的普及,对可扩展且易于适应的静态代码分析技术的需求正在加大。传统的静态分析工具,如 Clang Static Analyzer (CSA) 或 PMD,在检查编程规则或样式问题方面已经展现出了良好的效果。然而,这些工具通常是为了满足特定的目标而设计的,往往无法满足现代软件开发环境中多变和多元化的需求。这些需求可以涉及服务质量 (QoS)、各种编程语言、不同的算法需求,以及各种性能需求。例如,安全团队可能需要复杂的算法,如上下文敏感的污点分析,来审查较小的代码库,而项目经理可能需要一种相对较轻的算法,例如计算圈复杂度的算法,以在较大的代码库上测量开发人员的生产力。
这些多元化的需求,加上大型组织中常见的计算资源限制,构成了一项重大的挑战。由于传统工具采用的是问题特定的计算方式,往往无法在这种环境中实现扩展。因此,我们推出了 CodeQuery,这是一个专为大规模静态分析设计的集中式数据平台。
diff --git a/doc/2_introduction.en.md b/doc/2_introduction.en.md
new file mode 100644
index 00000000..768f4891
--- /dev/null
+++ b/doc/2_introduction.en.md
@@ -0,0 +1,109 @@
+# Introduction
+CodeFuse-Query is a code data platform that supports structured analysis of various programming languages. The core idea is to transform all code into data using various language parsers and to store this data in a structured format within a code database. Data analysis is then performed according to business needs using a custom query language, as shown in the diagram below:
+
+
+## 2.1 Architecture of CodeFuse-Query
+Overall, the CodeFuse-Query code data platform is divided into three main parts: the code data model, the code query DSL (Domain-Specific Language), and platform productization services. The main workflow is illustrated in the following diagram:
+### 
+
+### Code Datafication and Standardization: COREF
+We have defined a model for code datafication and standardization called COREF, which requires all code to be converted to this model through various language extractors.
+COREF mainly includes the following information:
+**COREF** = AST (Abstract Syntax Tree) + ASG (Abstract Semantic Graph) + CFG (Control Flow Graph) + PDG (Program Dependency Graph) + Call Graph + Class Hierarchy + Documentation (Documentation/Commentary Information)
+Note: As the computational complexity of each type of information varies, not all languages' COREF information includes all of the above. The basic information mainly includes AST, ASG, Call Graph, Class Hierarchy, and Documentation, while other information (CFG and PDG) is still under development and will be gradually supported.
+### Code Query DSL
+Based on the generated COREF code data, CodeFuse-Query uses a custom DSL language called **Gödel** for querying, thereby fulfilling code analysis requirements.
+Gödel is a logic-based reasoning language, whose underlying implementation is based on the logical reasoning language Datalog. By describing "facts" and "rules," the program can continuously derive new facts. Gödel is also a declarative language, focusing more on describing "what is needed" and leaving the implementation to the computational engine.
+Since code has already been converted to relational data (COREF data stored in the form of relational tables), one might wonder why not use SQL directly, or use an SDK instead of learning a new DSL language. Because Datalog's computation is monotonic and terminating. Simply put, Datalog sacrifices expressiveness to achieve higher performance, and Gödel inherits this feature.
+
+- Compared to SDKs, Gödel's main advantage is its ease of learning and use. As a declarative language, users do not need to focus on intermediate computations and can simply describe their needs as they would with SQL.
+- Compared to SQL, Gödel's advantages are stronger descriptive capabilities and faster computation speed, for example, describing recursive algorithms and multi-table joint queries, which are difficult for SQL.
+### Platformization and Productization
+CodeFuse-Query includes the **Sparrow CLI** and the online service **Query Centre**. Sparrow CLI contains all components and dependencies, such as extractors, data models, compilers, etc., and users can completely generate and query code data locally using Sparrow CLI (for how to use Sparrow CLI, please see Section 3: Installation, Configuration, Running). If users have online query needs, they can use the Query Centre to experiment.
+## 2.2 Languages Supported by CodeFuse-Query for Analysis
+As of October 31, 2023, CodeFuse-Query supports data analysis for 11 programming languages. Among these, support for 5 languages (Java, JavaScript, TypeScript, XML, Go) is very mature, while support for the remaining 6 languages (Objective-C, C++, Python3, Swift, SQL, Properties) is in beta and has room for further improvement. The specific support status is shown in the table below:
+
+| Language | Status | Number of Nodes in the COREF Model |
+| ------------- | ------ | ---------------------------------- |
+| Java | Mature | 162 |
+| XML | Mature | 12 |
+| TS/JS | Mature | 392 |
+| Go | Mature | 40 |
+| OC/C++ | Beta | 53/397 |
+| Python3 | Beta | 93 |
+| Swift | Beta | 248 |
+| SQL | Beta | 750 |
+| Properties | Beta | 9 |
+
+Note: The maturity level of the language status above is determined based on the types of information included in COREF and the actual implementation. Except for OC/C++, all languages support complete AST information and Documentation. For example, COREF for Java also supports ASG, Call Graph, Class Hierarchy, and some CFG information.
+## 2.3 Use Cases of CodeFuse-Query
+### Querying Code Features
+A developer wants to know which String type variables are used in Repo A, so they write a Gödel script as follows and submit it to the CodeFuse-Query system for results.
+```rust
+// script
+use coref::java::*
+
+fn out(var: string) -> bool {
+ for(v in Variable(JavaDB::load("coref_java_src.db"))) {
+ if (v.getType().getName() = "String" && var = v.getName()) {
+ return true
+ }
+ }
+}
+
+fn main() {
+ output(out())
+}
+```
+Similar needs: Queries for classes, functions, variables, return values, call graphs, class inheritance, etc.
+
+### Outputting Static Analysis Capabilities
+A security team member sets up **a system** to cross-verify that log data and code data are consistent. To complete a certain analysis task, they plan to derive static data D1 through Gödel queries, merge with dynamic data D2, and combine analysis to reach conclusion C. After verifying the technical feasibility on CodeFuse-Query, they integrate the system using the standard API provided by CodeFuse-Query.
+Similar needs: Using static analysis as a system checkpoint, improving testing efficiency, merging the analyzed data into a documentation.
+### Code Rule Checker
+A team lead finds that the team often introduces similar bugs, Bug A, **and decides to establish a code rule and its checker** to be applied during CodeReview. After writing an analysis query on the CodeFuse-Query platform and testing that it meets requirements, they codify the query as a code rule and roll it out to the CodeReview/CI phase. Since then, this bug has never occurred again.
+Similar needs: Writing static defect scanning rules to intercept code risks.
+### Analyzing Code Characteristics
+A developer from the R&D department wants to know the current proportion of Spring and Spring Boot projects in the code repository to quantify the promotion of the new framework. By writing a Gödel Query to describe different project analysis features, they **queried 110,000 code repositories at once** and obtained all the code data after a few dozen minutes, happily moving on to their KPIs.
+Similar needs: Application profiling, code profiling, architectural analysis.
+### Getting Statistical Data
+A researcher finds that traditional code complexity metrics struggle to accurately measure the complexity of the code. Inspired by international advanced experiences and a moment of insight, they design a set of complexity metrics and algorithms. After implementing it with Gödel and finding it already highly performant with little optimization, they quickly apply it to over 10 languages and more than 110,000 repositories. They now have an in-depth understanding of the overall complexity of the code repositories, unlike before when they had to parse the code and analyze the syntax tree themselves, **which is so much more convenient**.
+Similar needs: Code statistics, code metrics, algorithm design, academic research.
+### Architectural Analysis
+An architect recently promoted a new message middleware based on txt files, and existing analysis platforms couldn't support analyzing dependencies in such systems. By quickly modeling the message format with Gödel, they soon obtain the dependency relationships between different components in the system.
+Similar needs: System overview, architecture governance, lineage analysis.
+### Model Validation
+A developer designs a system that requires users to play games before claiming coupons. They describe **the model's validation logic** with Gödel, then use the CodeFuse-Query system to **ensure that both current and future system implementations** fully comply with the model. No longer worried about potential financial losses from the game!
+Similar needs: System verification, network validation, permission verification.
+## 2.4 Application Areas of CodeFuse-Query
+Currently, CodeFuse-Query at Ant Group already supports **CodeFuse large language model data cleaning**, **code metrics evaluation**, **R&D risk control**, **privacy security analysis**, **code intelligence**, **terminal package size management**, and other scenarios with implemented applications, serving over a million monthly calls.
+
+
+### High-Quality Code Data Cleaning - CodeFuse Large Code Model
+The CodeFuse Large Code Model is a model by Ant Group for handling code-related issues and has been open-sourced. For the CodeFuse large language model, the quality of the training data directly affects the model's inference results. Low-quality code data can directly contaminate the language model's output, for example: the model might learn incorrect code patterns, generating erroneous code; if the data only contains code in a single programming language, the model might not adapt well to code in other languages.
+To control the quality of code data entering the model and thereby improve the model's inferencing capabilities, we have drawn upon the Ant Group program analysis team's years of practical experience coupled with industry consensus to clarify the definition of high-quality code. We have also implemented automated, large-scale code data cleaning using existing program analysis technologies.
+CodeFuse-Query provides the following data cleaning capabilities for the CodeFuse Large Code Model:
+
+- High-quality code data cleaning: Clean code data, including vulnerability scanning for 7 languages (Python, Java, JavaScript, TypeScript, Go, C, C++), filtering by language type/star number, filtering out data with 0 valid lines of code, etc. We have currently accumulated about **2TB** of cleaned code data from GitHub and internally at Ant Group.
+- Code Profiling: Implements high-performance, multi-dimensional automatic tagging for large-scale code, supporting **10** languages (Java, Scala, Kotlin, JavaScript, JSX, TypeScript, TSX, Vue, Python, Go), **77** common tags, **40** Ant-specific tags, totaling **117** tags. The current auto-tagging performance can reach **40MB/s**.
+- Other Atomic Abilities
+ - Advanced code feature extraction, including extraction of AST (Abstract Syntax Tree), DFG (Data Flow Graph), etc. The AST information has been used for SFT training with about 97% accuracy.
+ - Code snippet identification, used for extracting code from text data, convenient for formatting or adding Markdown:
+ - Text extraction of code: Extracting code block information from text, parsing main languages, function and class definitions, only verifying a binary problem, that is, verifying whether the text contains code blocks with about 83% accuracy.
+ - Identifying the programming language of a code snippet: Identifying the programming language of any code snippet, supporting 30+ languages, with about 80% accuracy.
+ - Code comment pair extraction: Supports extracting method-level comment-code pair information, covering **15** most popular languages on GitHub, used for Text To Code/Code To Text SFT training.
+### Code Data Metrics - Guangmu
+Guangmu is an internal product at Ant Group aimed at different R&D personnel and team managers, providing objective data and analysis results to assess code capabilities.
+Guangmu offers individual code capability assessment reports, daily code capability metric data analysis, team code capability management, and code excellence award displays, all aimed at helping Ant Group's R&D engineers continuously improve code quality, reduce code debt, and enhance R&D efficiency in the long run.
+CodeFuse-Query provides Guangmu with two types of capabilities:
+
+- Code Evaluation Metrics: Code complexity, code annotation rate, standard development volume, etc.
+- Code Excellence Metrics: Code reuse degree.
+### Change Analysis - Youku Server-Side R&D Efficiency
+The Youku Quality Assurance team started exploring server-side precision testing in 2023. After six months of technical sedimentation and system building, they established a precision testing system capable of **change content identification, change impact analysis, testing capability recommendation, and test coverage assessment**.
+In this process, CodeFuse-Query can provide capabilities including:
+
+- Analyzing the impacted objects based on code change content (file + line number): methods, entry points (HTTP entry, HSF entry), call routes (all call routes from the entry to the changed method), database operations (tables, types of operations).
+- Enhancing the effectiveness and readiness of change analysis impact by combining the precise analysis capabilities of online dynamic call routes (method routes) and CodeFuse-Query static analysis call routes.
+
+To date, Youku has integrated all core applications through CodeFuse-Query and based on static analysis data collection, has built a complete server-side code and traffic knowledge base.
\ No newline at end of file
diff --git a/doc/2_introduction.md b/doc/2_introduction.md
index bf4141c2..31267695 100644
--- a/doc/2_introduction.md
+++ b/doc/2_introduction.md
@@ -1,3 +1,4 @@
+# 概述
CodeFuse-Query 是一个支持对 **各种编程语言** 进行 **结构化分析** 的 **代码数据平台**。核心思想是利用各种语言解析器将所有代码转化为数据,并将其结构化存储到代码数据库中。通过使用自定义查询语言,按照业务需求进行数据分析。如下图所示:

diff --git a/doc/3_install_and_run.en.md b/doc/3_install_and_run.en.md
new file mode 100644
index 00000000..0a29681d
--- /dev/null
+++ b/doc/3_install_and_run.en.md
@@ -0,0 +1,166 @@
+# Installation, Configuration, and Running
+
+## Hardware and Software Requirements
+
+- Hardware: 4C8G
+
+- Environment Requirements: Java 1.8 and Python 3.8 or above runtime environments. Please ensure Java and Python executables are available.
+
+## Sparrow Installation Steps and Guidance
+
+- The CodeFuse-Query download package is a zip archive that contains tools, scripts, and various files specific to CodeFuse-Query. If you do not have a CodeFuse-Query license, downloading this archive indicates your agreement with the [CodeFuse-Query Terms and Conditions](../LICENSE).
+- CodeFuse-Query is currently only supported on Mac and Linux systems. The download links are: (currently, only a sample is given, the official download link will be provided after open-source release)
+ - Mac: [CodeFuse-Query 2.0.0](https://github.com/codefuse-ai/CodeFuse-Query/releases/tag/2.0.0)
+ - Linux: [CodeFuse-Query 2.0.0](https://github.com/codefuse-ai/CodeFuse-Query/releases/tag/2.0.0)
+- You should always use the CodeFuse-Query bundle to ensure version compatibility.
+
+### Tips:
+
+- On Mac systems, directly downloading the package may prompt a verification for the developer.
+
+
+
+- You can modify the verification in the security settings.
+
+
+
+- Click "Allow Anyway."
+
+- For detailed steps, please refer to the [Mac Official Documentation: How to safely open an app on your Mac](https://support.apple.com/zh-cn/HT202491)
+
+- Or use the `xattr -d com.apple.quarantine` command to remove the external attribute assigned to CodeFuse-Query by macOS.
+
+- `xattr -d com.apple.quarantine` is a command-line instruction used to delete a file's `com.apple.quarantine` extended attribute. This attribute is used by the macOS system to mark files or applications downloaded from external sources to ensure security.
+
+```java
+xattr -d com.apple.quarantine path/to/file
+```
+
+## Configuring and Initializing the CodeFuse-Query Development Environment
+
+- Unzip using the command line or by simply clicking to unzip.
+
+- You need to have Java 8 and Python 3.8 or higher runtime environments.
+
+- After unzipping CodeFuse-Query, you can run the Sparrow process by running the executable in the following ways:
+
+- By executing `/sparrow-cli/sparrow`, where `` is the folder where you extracted the CodeFuse-Query package.
+
+- By adding `/sparrow-cli` to your PATH, so you can directly run the executable `sparrow`.
+
+At this point, you can execute the `sparrow` command.
+
+## Running
+
+### Execution Steps
+
+- Confirm the source code directory you need to query.
+
+- Extract code data from the source code.
+
+- Write a Gödel script based on the code data to obtain the desired code data.
+
+- For how to write Gödel scripts, refer to [GödelScript Query Language](./4_godelscript_language.md)
+
+### Execution Example
+
+#### Data Extraction
+```java
+/sparrow-cli/sparrow database create -s -lang -o