Description
- Gitea version (or commit ref): release/v1.9
- Git version: 2.22.0
- Operating system: LInux - CentOS 7
- Database (use
[x]
):- PostgreSQL
- MySQL
- MSSQL
- SQLite
- Can you reproduce the bug at https://try.gitea.io:
- Yes (provide example URL)
- No
- Not relevant
- Log gist:
Description
When using the repository indexer, files with multi-byte character sets don't get correctly indexed. This happens when characters look like valid utf-8 code points but they are not. Once a bad sequence is encontered the rest of the file is indexed as a single token; e.g. if the file is 100KB and the bad sequence is at the middle of it, the indexer gets the first half of the file OK, and the rest as one "word" which is 50KB long (and certainly not searchable).
To reproduce this issue, files with the folloging content can be tested using utf-8 and Latin1 character sets:
sailorvenus
áéíóú
sailormoon
Note: to test properly the files must be commited through git, not Gitea's web interface.
Searching for sailorvenus
brings results, as it is the first word. In the Latin1 encoded file the rest of the context is garbled.
Searching for sailormoon
doesn't bring results from the Latin1 encoded file, as the indexing for the rest of the file is garbled: