]> source.dussan.org Git - gitea.git/commit
Updated tokenizer to better matching when search for code snippets (#32261)
authorBruno Sofiato <bruno.sofiato@gmail.com>
Wed, 6 Nov 2024 20:51:20 +0000 (17:51 -0300)
committerGitHub <noreply@github.com>
Wed, 6 Nov 2024 20:51:20 +0000 (20:51 +0000)
commitf64fbd9b74998f3ac8353d2a8344e2e6f0ce1936
treedbfb9630d889f9a1dc193990a6b33f53fe773902
parentb573512312d82e894db7aac89f4938a6b61e1e70
Updated tokenizer to better matching when search for code snippets (#32261)

This PR improves the accuracy of Gitea's code search.

Currently, Gitea does not consider statements such as
`onsole.log("hello")` as hits when the user searches for `log`. The
culprit is how both ES and Bleve are tokenizing the file contents (in
both cases, `console.log` is a whole token).

In ES' case, we changed the tokenizer to
[simple_pattern_split](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html#:~:text=The%20simple_pattern_split%20tokenizer%20uses%20a,the%20tokenization%20is%20generally%20faster.).
In such a case, tokens are words formed by digits and letters. In
Bleve's case, it employs a
[letter](https://blevesearch.com/docs/Tokenizers/) tokenizer.

Resolves #32220

---------

Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
18 files changed:
modules/indexer/code/bleve/bleve.go
modules/indexer/code/elasticsearch/elasticsearch.go
modules/indexer/code/indexer_test.go
modules/indexer/internal/bleve/util.go
modules/indexer/internal/bleve/util_test.go
tests/gitea-repositories-meta/org42/search-by-path.git/description
tests/gitea-repositories-meta/org42/search-by-path.git/info/refs
tests/gitea-repositories-meta/org42/search-by-path.git/objects/info/commit-graph [deleted file]
tests/gitea-repositories-meta/org42/search-by-path.git/objects/info/packs
tests/gitea-repositories-meta/org42/search-by-path.git/objects/pack/pack-393dc29256bc27cb2ec73898507df710be7a3cf5.bitmap [deleted file]
tests/gitea-repositories-meta/org42/search-by-path.git/objects/pack/pack-393dc29256bc27cb2ec73898507df710be7a3cf5.idx [deleted file]
tests/gitea-repositories-meta/org42/search-by-path.git/objects/pack/pack-393dc29256bc27cb2ec73898507df710be7a3cf5.pack [deleted file]
tests/gitea-repositories-meta/org42/search-by-path.git/objects/pack/pack-393dc29256bc27cb2ec73898507df710be7a3cf5.rev [deleted file]
tests/gitea-repositories-meta/org42/search-by-path.git/objects/pack/pack-a7bef76cf6e2b46bc816936ab69306fb10aea571.bitmap [new file with mode: 0644]
tests/gitea-repositories-meta/org42/search-by-path.git/objects/pack/pack-a7bef76cf6e2b46bc816936ab69306fb10aea571.idx [new file with mode: 0644]
tests/gitea-repositories-meta/org42/search-by-path.git/objects/pack/pack-a7bef76cf6e2b46bc816936ab69306fb10aea571.pack [new file with mode: 0644]
tests/gitea-repositories-meta/org42/search-by-path.git/objects/pack/pack-a7bef76cf6e2b46bc816936ab69306fb10aea571.rev [new file with mode: 0644]
tests/gitea-repositories-meta/org42/search-by-path.git/packed-refs