Repository indexer clogs with file with multi-byte character sets

- Gitea version (or commit ref): release/v1.9
- Git version: 2.22.0
- Operating system: LInux - CentOS 7
- Database (use `[x]`):
  - [x] PostgreSQL
  - [ ] MySQL
  - [ ] MSSQL
  - [ ] SQLite
- Can you reproduce the bug at https://try.gitea.io:
  - [ ] Yes (provide example URL)
  - [x] No
  - [ ] Not relevant
- Log gist:

## Description

When using the repository indexer, files with multi-byte character sets don't get correctly indexed. This happens when characters look like valid utf-8 code points but they are not. Once a bad sequence is encontered the rest of the file is indexed as **a single token**; e.g. if the file is 100KB and the bad sequence is at the middle of it, the indexer gets the first half of the file OK, and the rest as one "word" which is 50KB long (and certainly not searchable).

To reproduce this issue, files with the folloging content can be tested using utf-8 and Latin1 character sets:

```
sailorvenus
áéíóú
sailormoon
```

**Note:** to test properly the files must be commited through git, not Gitea's web interface.

Searching for `sailorvenus` brings results, as it is the first word. In the Latin1 encoded file the rest of the context is garbled.
![image](https://user-images.githubusercontent.com/18600385/62816051-f1ec6280-baf7-11e9-87e9-a6d0c9576d36.png)

Searching for `sailormoon` doesn't bring results from the Latin1 encoded file, as the indexing for the rest of the file is garbled:
![image](https://user-images.githubusercontent.com/18600385/62816059-121c2180-baf8-11e9-8751-18c5628ad930.png)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository indexer clogs with file with multi-byte character sets #7809

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Repository indexer clogs with file with multi-byte character sets #7809

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions