Skip to content

Repository indexer clogs with file with multi-byte character sets #7809

Closed
@guillep2k

Description

@guillep2k
  • Gitea version (or commit ref): release/v1.9
  • Git version: 2.22.0
  • Operating system: LInux - CentOS 7
  • Database (use [x]):
    • PostgreSQL
    • MySQL
    • MSSQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)
    • No
    • Not relevant
  • Log gist:

Description

When using the repository indexer, files with multi-byte character sets don't get correctly indexed. This happens when characters look like valid utf-8 code points but they are not. Once a bad sequence is encontered the rest of the file is indexed as a single token; e.g. if the file is 100KB and the bad sequence is at the middle of it, the indexer gets the first half of the file OK, and the rest as one "word" which is 50KB long (and certainly not searchable).

To reproduce this issue, files with the folloging content can be tested using utf-8 and Latin1 character sets:

sailorvenus
áéíóú
sailormoon

Note: to test properly the files must be commited through git, not Gitea's web interface.

Searching for sailorvenus brings results, as it is the first word. In the Latin1 encoded file the rest of the context is garbled.
image

Searching for sailormoon doesn't bring results from the Latin1 encoded file, as the indexing for the rest of the file is garbled:
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions