一個 shell 指令即可找出文本中的每個 n 元語法

Question 1

一個（主要是）sed解決方案：

cat "$@" |
    tr -cs -- '._[:alpha:]' '[\n*]' |
    sed -n  -e 'h; :ms' \
            -e 'p; :ss' \
                -e 's/\([[:lower:]]\)[[:upper:]][[:lower:]]*$/\1/p; t ss' \
                -e 's/\([[:lower:]]\)[[:upper:]][[:upper:]]*$/\1/p; t ss' \
                -e 's/\([[:upper:]]\)[[:upper:]][[:lower:]]\+$/\1/p; t ss' \
                -e 's/[._][[:alpha:]][[:lower:]]*$//p; t ss' \
                -e 's/[._][[:upper:]]\+$//p; t ss' \
            -e 'g' \
            -e 's/^[[:upper:]]\?[[:lower:]]\+\([[:upper:]]\)/\1/; t mw' \
            -e 's/^[[:upper:]]\+\([[:upper:]][[:lower:]]\)/\1/; t mw' \
            -e 's/^[[:alpha:]][[:lower:]]*[._]//; t mw' \
            -e 's/^[[:upper:]]\+[._]//; t mw' \
            -e 'b' \
            -e ':mw; h; b ms'

演算法是

for each compound word (e.g., “FOO_BAR_test”) in the input
do
    repeat
        print what you’ve got
        repeat
            remove a small word from the end (e.g., “FOO_BAR_test” → “FOO_BAR”) and print what’s left
        until you’re down to the last one (e.g., “FOO_BAR_test” → “FOO”)
        go back to what you had at the beginning of the above loop
          and remove a small word from the beginning
          (e.g., “FOO_BAR_test” → “BAR_test”) ... but don’t print anything
    until you’re down to the last one (e.g., “FOO_BAR_test” → “test”)
end for loop

細節：

cat "$@"是一個UUOC。我通常會避免這些；你可以這樣做，但你不能直接傳遞多個文件。tr args < filetr
tr -cs -- '._[:alpha:]' '[\n*]'將一行許多複合詞分成單獨的行；例如，
```
I_amAManTest you_haveAHouse FOO_BAR_test
```
變成
```
I_amAManTest
you_haveAHouse
FOO_BAR_test
```
因此 sed 一次可以處理一個複合字。
sed -n— 不自動列印任何內容；僅在收到命令時才列印。
-e指定以下內容expression 是 sed 腳本的一部分。
h— 將模式空間複製到保留空間。
:ms— 標籤（主循環開始）
p- 列印
:ss— 標籤（次級循環開始）
以下命令從複合詞的末尾刪除一個小詞，如果成功，則列印結果並跳回輔助循環的開頭。
- s/$[[:lower:]]$[[:upper:]][[:lower:]]*$/\1/p; t ss— 將“nTest”更改為“n”。
- s/$[[:lower:]]$[[:upper:]][[:upper:]]*$/\1/p; t ss— 將“mOK”更改為“m”。
- s/$[[:upper:]]$[[:upper:]][[:lower:]]\+$/\1/p; t ss— 將“AMan”更改為“A”。
- s/[._][[:alpha:]][[:lower:]]*$//p; t ss— 刪除“_am”（將其替換為空）。
- s/[._][[:upper:]]\+$//p; t ss— 刪除“_BAR”（將其替換為空）。
這是輔助循環的結束。
g— 將保持空間複製到模式空間（返回上述循環開始時的內容）。
以下命令從複合詞的開頭刪除一個小詞，如果成功，則跳到主循環的末尾（mw = 主循環總結）。
s/^[[:upper:]]\?[[:lower:]]\+$[[:upper:]]$/\1/; t mw— 將“amA”更改為“A”，將“ManT”更改為“T”。
s/^[[:upper:]]\+$[[:upper:]][[:lower:]]$/\1/; t mw— 將“AMa”更改為“Ma”。
s/^[[:alpha:]][[:lower:]]*[._]//; t mw— 刪除“I_”和“you_”（將其替換為空）。
s/^[[:upper:]]\+[._]//; t mw— 刪除“FOO_”（將其替換為空）。
如果成功（如果找到/匹配某些內容），上述每個替代命令都會跳到主循環總結（如下）。如果我們到達這裡，模式空間只包含一個小單詞，所以我們就完成了。
b— 分支（跳轉）到 sed 腳本的結尾；即退出 sed 腳本。
:mw— 主循環總結的標籤。
h— 將模式空間複製到保留空間，為主循環的下一個迭代做好準備。
b ms— 跳到主循環的開頭。

它產生所請求的輸出。不幸的是，它以不同的順序排列。如果這很重要的話我可能可以解決這個問題。

$ echo "I_amAManTest you_haveAHouse FOO_BAR_test" | ./myscript
I_amAManTest
I_amAMan
I_amA
I_am
I
amAManTest
amAMan
amA
am
AManTest
AMan
A
ManTest
Man
Test
you_haveAHouse
you_haveA
you_have
you
haveAHouse
haveA
have
AHouse
A
House
FOO_BAR_test
FOO_BAR
FOO
BAR_test
BAR
Test

Answer

一個（主要是）sed解決方案：

cat "$@" |
    tr -cs -- '._[:alpha:]' '[\n*]' |
    sed -n  -e 'h; :ms' \
            -e 'p; :ss' \
                -e 's/\([[:lower:]]\)[[:upper:]][[:lower:]]*$/\1/p; t ss' \
                -e 's/\([[:lower:]]\)[[:upper:]][[:upper:]]*$/\1/p; t ss' \
                -e 's/\([[:upper:]]\)[[:upper:]][[:lower:]]\+$/\1/p; t ss' \
                -e 's/[._][[:alpha:]][[:lower:]]*$//p; t ss' \
                -e 's/[._][[:upper:]]\+$//p; t ss' \
            -e 'g' \
            -e 's/^[[:upper:]]\?[[:lower:]]\+\([[:upper:]]\)/\1/; t mw' \
            -e 's/^[[:upper:]]\+\([[:upper:]][[:lower:]]\)/\1/; t mw' \
            -e 's/^[[:alpha:]][[:lower:]]*[._]//; t mw' \
            -e 's/^[[:upper:]]\+[._]//; t mw' \
            -e 'b' \
            -e ':mw; h; b ms'

演算法是

for each compound word (e.g., “FOO_BAR_test”) in the input
do
    repeat
        print what you’ve got
        repeat
            remove a small word from the end (e.g., “FOO_BAR_test” → “FOO_BAR”) and print what’s left
        until you’re down to the last one (e.g., “FOO_BAR_test” → “FOO”)
        go back to what you had at the beginning of the above loop
          and remove a small word from the beginning
          (e.g., “FOO_BAR_test” → “BAR_test”) ... but don’t print anything
    until you’re down to the last one (e.g., “FOO_BAR_test” → “test”)
end for loop

細節：

cat "$@"是一個UUOC。我通常會避免這些；你可以這樣做，但你不能直接傳遞多個文件。tr args < filetr
tr -cs -- '._[:alpha:]' '[\n*]'將一行許多複合詞分成單獨的行；例如，
```
I_amAManTest you_haveAHouse FOO_BAR_test
```
變成
```
I_amAManTest
you_haveAHouse
FOO_BAR_test
```
因此 sed 一次可以處理一個複合字。
sed -n— 不自動列印任何內容；僅在收到命令時才列印。
-e指定以下內容expression 是 sed 腳本的一部分。
h— 將模式空間複製到保留空間。
:ms— 標籤（主循環開始）
p- 列印
:ss— 標籤（次級循環開始）
以下命令從複合詞的末尾刪除一個小詞，如果成功，則列印結果並跳回輔助循環的開頭。
- s/$[[:lower:]]$[[:upper:]][[:lower:]]*$/\1/p; t ss— 將“nTest”更改為“n”。
- s/$[[:lower:]]$[[:upper:]][[:upper:]]*$/\1/p; t ss— 將“mOK”更改為“m”。
- s/$[[:upper:]]$[[:upper:]][[:lower:]]\+$/\1/p; t ss— 將“AMan”更改為“A”。
- s/[._][[:alpha:]][[:lower:]]*$//p; t ss— 刪除“_am”（將其替換為空）。
- s/[._][[:upper:]]\+$//p; t ss— 刪除“_BAR”（將其替換為空）。
這是輔助循環的結束。
g— 將保持空間複製到模式空間（返回上述循環開始時的內容）。
以下命令從複合詞的開頭刪除一個小詞，如果成功，則跳到主循環的末尾（mw = 主循環總結）。
s/^[[:upper:]]\?[[:lower:]]\+$[[:upper:]]$/\1/; t mw— 將“amA”更改為“A”，將“ManT”更改為“T”。
s/^[[:upper:]]\+$[[:upper:]][[:lower:]]$/\1/; t mw— 將“AMa”更改為“Ma”。
s/^[[:alpha:]][[:lower:]]*[._]//; t mw— 刪除“I_”和“you_”（將其替換為空）。
s/^[[:upper:]]\+[._]//; t mw— 刪除“FOO_”（將其替換為空）。
如果成功（如果找到/匹配某些內容），上述每個替代命令都會跳到主循環總結（如下）。如果我們到達這裡，模式空間只包含一個小單詞，所以我們就完成了。
b— 分支（跳轉）到 sed 腳本的結尾；即退出 sed 腳本。
:mw— 主循環總結的標籤。
h— 將模式空間複製到保留空間，為主循環的下一個迭代做好準備。
b ms— 跳到主循環的開頭。

它產生所請求的輸出。不幸的是，它以不同的順序排列。如果這很重要的話我可能可以解決這個問題。

$ echo "I_amAManTest you_haveAHouse FOO_BAR_test" | ./myscript
I_amAManTest
I_amAMan
I_amA
I_am
I
amAManTest
amAMan
amA
am
AManTest
AMan
A
ManTest
Man
Test
you_haveAHouse
you_haveA
you_have
you
haveAHouse
haveA
have
AHouse
A
House
FOO_BAR_test
FOO_BAR
FOO
BAR_test
BAR
Test

Question 2

您最好的選擇可能是為 perl 找到一個分詞器模組。如果沒有多次運行，Grep 就無法做到這一點，可能需要-P（PCRE）。

這是沒有任何 perl 模組的部分解決方案：

while (<>) {
  my $n = 1;
  while (/(\S+)/g) {
    printf "// outputting whitespace-separated word %d\n", $n++;
    my $whole = $1;
    while ($whole =~ /([a-zA-Z0-9][a-z]*+)/g) {
      print "$1\n";
    }
    print "$whole\n";    # whole space-delimited tokens
  }
}

這將從標準輸入或檔案中讀取輸入，一次一行。$n是列印註解的單字計數器，然後我們迭代單字（由空格界定，因此正規表示式/(\S+)/g全域匹配連續的非空格字元）。在每個單字中，我們使用以下方法迭代標記部分([a-zA-Z0-9][a-z]*+)，其匹配項全部以數字或字母開頭，後面跟著零個或多個小寫字母（*+就像*禁用回溯以防止重做服務）。在列印單字中所有匹配的標記後，我們列印整個單字。

您可以將其運行為perl solution.pl intput.txt或內聯，如下所示：

$ echo "I_amAManTest you_haveAHouse FOO_BAR_test.model" |perl solution.pl
// outputting whitespace-separated word 1
I
am
A
Man
Test
I_amAManTest
// outputting whitespace-separated word 2
you
have
A
House
you_haveAHouse
// outputting whitespace-separated word 3
F
O
O
B
A
R
test
model
FOO_BAR_test.model

請注意，這缺少單字的多部分子標記。

I_AmAMan另請注意，您的解析為I, Am, A,的請求與解析為,Man的請求衝突，而不是像上面的程式碼那樣解析為, , , ... 。（也許更好的例子是：應該變成什麼？三個一元字還是四個？）FOO_BARFOOBARFOOBI_AmOK

Answer