Skip to content

Commit 1c6434b

Browse files
aws-rafamsRafael Moscamergify[bot]
authored
chore(bedrock): add nori identifiers (#1042)
Co-authored-by: Rafael Mosca <rafams@amazon.es> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
1 parent 275dc92 commit 1c6434b

File tree

3 files changed

+102
-0
lines changed

3 files changed

+102
-0
lines changed

apidocs/namespaces/opensearchserverless/enumerations/TokenFilterType.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,44 +6,86 @@
66

77
# Enumeration: TokenFilterType
88

9+
TokenFilterType defines the available token filters for text analysis.
10+
Token filters process tokens after they have been created by the tokenizer.
11+
They can modify, add, or remove tokens based on specific rules.
12+
913
## Enumeration Members
1014

1115
### CJK\_WIDTH
1216

1317
> **CJK\_WIDTH**: `"cjk_width"`
1418
19+
Normalizes CJK width differences by converting all characters to their fullwidth or halfwidth variants
20+
1521
***
1622

1723
### ICU\_FOLDING
1824

1925
> **ICU\_FOLDING**: `"icu_folding"`
2026
27+
Applies Unicode folding rules for better text matching
28+
2129
***
2230

2331
### JA\_STOP
2432

2533
> **JA\_STOP**: `"ja_stop"`
2634
35+
Removes Japanese stop words from text
36+
2737
***
2838

2939
### KUROMOJI\_BASEFORM
3040

3141
> **KUROMOJI\_BASEFORM**: `"kuromoji_baseform"`
3242
43+
Converts inflected Japanese words to their base form
44+
3345
***
3446

3547
### KUROMOJI\_PART\_OF\_SPEECH
3648

3749
> **KUROMOJI\_PART\_OF\_SPEECH**: `"kuromoji_part_of_speech"`
3850
51+
Tags words with their parts of speech in Japanese text analysis
52+
3953
***
4054

4155
### KUROMOJI\_STEMMER
4256

4357
> **KUROMOJI\_STEMMER**: `"kuromoji_stemmer"`
4458
59+
Reduces Japanese words to their stem form
60+
4561
***
4662

4763
### LOWERCASE
4864

4965
> **LOWERCASE**: `"lowercase"`
66+
67+
Converts all characters to lowercase
68+
69+
***
70+
71+
### NORI\_NUMBER
72+
73+
> **NORI\_NUMBER**: `"nori_number"`
74+
75+
Normalizes Korean numbers to regular Arabic numbers
76+
77+
***
78+
79+
### NORI\_PART\_OF\_SPEECH
80+
81+
> **NORI\_PART\_OF\_SPEECH**: `"nori_part_of_speech"`
82+
83+
Tags words with their parts of speech in Korean text analysis
84+
85+
***
86+
87+
### NORI\_READINGFORM
88+
89+
> **NORI\_READINGFORM**: `"nori_readingform"`
90+
91+
Converts Korean text to its reading form

apidocs/namespaces/opensearchserverless/enumerations/TokenizerType.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,20 @@
1212

1313
> **ICU\_TOKENIZER**: `"icu_tokenizer"`
1414
15+
ICU tokenizer is used for Unicode text segmentation based on UAX #29 rules
16+
1517
***
1618

1719
### KUROMOJI\_TOKENIZER
1820

1921
> **KUROMOJI\_TOKENIZER**: `"kuromoji_tokenizer"`
22+
23+
Kuromoji tokenizer is used for Japanese text analysis and segmentation
24+
25+
***
26+
27+
### NORI\_TOKENIZER
28+
29+
> **NORI\_TOKENIZER**: `"nori_tokenizer"`
30+
31+
Nori tokenizer is used for Korean text analysis and segmentation

src/cdk-lib/opensearchserverless/analysis-plugins.ts

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,16 +19,64 @@ export enum CharacterFilterType {
1919
// Also see the following link for more information regarding supported plugins:
2020
// https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-genref.html#serverless-plugins
2121
export enum TokenizerType {
22+
/**
23+
* Kuromoji tokenizer is used for Japanese text analysis and segmentation
24+
*/
2225
KUROMOJI_TOKENIZER = 'kuromoji_tokenizer',
26+
/**
27+
* ICU tokenizer is used for Unicode text segmentation based on UAX #29 rules
28+
*/
2329
ICU_TOKENIZER = 'icu_tokenizer',
30+
/**
31+
* Nori tokenizer is used for Korean text analysis and segmentation
32+
*/
33+
NORI_TOKENIZER = 'nori_tokenizer',
2434
}
2535

36+
/**
37+
* TokenFilterType defines the available token filters for text analysis.
38+
* Token filters process tokens after they have been created by the tokenizer.
39+
* They can modify, add, or remove tokens based on specific rules.
40+
*/
2641
export enum TokenFilterType {
42+
/**
43+
* Converts inflected Japanese words to their base form
44+
*/
2745
KUROMOJI_BASEFORM = 'kuromoji_baseform',
46+
/**
47+
* Tags words with their parts of speech in Japanese text analysis
48+
*/
2849
KUROMOJI_PART_OF_SPEECH = 'kuromoji_part_of_speech',
50+
/**
51+
* Reduces Japanese words to their stem form
52+
*/
2953
KUROMOJI_STEMMER = 'kuromoji_stemmer',
54+
/**
55+
* Normalizes CJK width differences by converting all characters to their fullwidth or halfwidth variants
56+
*/
3057
CJK_WIDTH = 'cjk_width',
58+
/**
59+
* Removes Japanese stop words from text
60+
*/
3161
JA_STOP = 'ja_stop',
62+
/**
63+
* Converts all characters to lowercase
64+
*/
3265
LOWERCASE = 'lowercase',
66+
/**
67+
* Applies Unicode folding rules for better text matching
68+
*/
3369
ICU_FOLDING = 'icu_folding',
70+
/**
71+
* Tags words with their parts of speech in Korean text analysis
72+
*/
73+
NORI_PART_OF_SPEECH = 'nori_part_of_speech',
74+
/**
75+
* Converts Korean text to its reading form
76+
*/
77+
NORI_READINGFORM = 'nori_readingform',
78+
/**
79+
* Normalizes Korean numbers to regular Arabic numbers
80+
*/
81+
NORI_NUMBER = 'nori_number',
3482
}

0 commit comments

Comments
 (0)