從 vtt 檔案中抓取文字

2024-6-6 • tag-icon

text-processing sed grep regular-expression json

從 vtt 檔案中抓取文字

vtt 文件如下所示：

WEBVTT

1
00:00:00.096 --> 00:00:05.047
you're the four functions if you would of 
management first of all you have the planning

2
00:00:06.002 --> 00:00:10.079
the planning stages basically you were choosing appropriate 
 organizational goals and courses

3
00:00:11.018 --> 00:00:13.003
action to best achieve those goals

我只需要文本，如下所示：

you're the four functions if you would of management first of all you have the planning the planning stages basically you were choosing appropriate organizational goals and courses action to best achieve those goals

在 ubuntu 上我嘗試過：

cat file.vtt | grep -v [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9][[:space:]][[:punct:]][[:punct:]][[:punct:]][[:space:]][0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]

這給了我：

WEBVTT

1
you're the four functions if you would of 
management first of all you have the planning

2
the planning stages basically you were choosing appropriate 
 organizational goals and courses

3
action to best achieve those goals

但我不知道如何做剩下的事情。我想替換的是

\n[0-9]+\n\n有空格，但我不知道如何讓 sed 或 grep 做到這一點。

我如何使用基本/便攜式（例如，通常預裝在ubuntu、centos等中，例如grep、sed或tr命令）獲得刪除了字幕計時的原始文本，並且全部在一行中（沒有換行符）？

注意：這必須適用於其他語言字符，例如中文印地語阿拉伯語，因此最好沒有 [az] 類型匹配，而是刪除格式非常一致的計時線。也不要盲目刪除任何數字，因為文字可以包含數字

注意 2：最終目標是讓文字對於 json 值來說是安全的，因此所有特殊字元都被刪除並雙引號被轉義，但這超出了這個問題的範圍

答案1

由於您的文件似乎由一系列由一個或多個空白行分隔的記錄組成，因此我建議嘗試基於段落模式awk或之一perl。

例如，如果您總是需要刪除前兩行，例如

1
00:00:00.096 --> 00:00:05.047

您可以在空白分隔的段落中拆分為換行符號分隔的字段，並使用以下任一方法跳過前兩個字段

awk -vRS= -vORS= -F'\n' '{for(j=3;j<=NF;j++) print $j; print " "}' file.vtt

或者

perl -F'\n' -00ne 'print join("", @F[2..$#F]), " "' file.vtt

如果您無法依賴要刪除的固定數量的欄位（行），那麼新增正規表示式測試相當容易 - 更容易一些，perl因為它允許我們grep直接在陣列上而不是編寫明確循環。例如，要拆分為以空格分隔的記錄，然後僅列印那些至少具有至少 3 個字母字元的序列的欄位（行），您可以使用

perl -F'\n' -00ane '
  print join("", grep { /[[:alpha:]]{3}/ } @F), " "
' file.vtt

如果你想排除該WEBVTT字串，你可以簡單地跳過第一筆記錄，即

perl -F'\n' -00ane '
  print join("", grep { /[[:alpha:]]{3}/ } @F), " " if $. > 1
  ' file.vtt

您可以選擇一個合適的正規表示式來捕獲所需的行並排除不需要的行。如果您想要為連接的輸出新增最終換行符，則可以END在或中新增一個區塊awk。perl

注意：由於（基於評論中的討論）您的文件似乎具有 DOS 樣式的CRLF行結尾，因此您需要處理這些問題 - 通過相應地修改上述命令中的字段和記錄分隔符，或者刪除CRs第一個例如

sed 's/\r$//' file.vtt | 
  perl -F'\n' -00ane '
    print join("", grep { /[[:alpha:]]{3}/ } @F), " " if $. > 1
  '
you're the four functions if you would of management first of all you have the planning the planning stages basically you were choosing appropriate  organizational goals and courses action to best achieve those goals steeldriver@xenial-vm:~/test/$

答案2

好的，這就是我的結果

#!/bin/bash
fname=$1
sed 's/\r$//' "$fname"    |\
grep -v -- "-->"          |\
grep -v "^$"              |\
grep -E -v "^[0-9]+$"     |\
sed 's/WEBVTT//'          |\
tr '\n' ' '               |\
tr -s ' '                 |\
tr -d '\t'                |\
sed 's/\\/\\\\/g'         |\
sed 's/"/\\"/g'

修復 Windows 換行符
尋找所有沒有 --> 的行
找到所有不為空的行（我認為這更快，也許不是）
尋找所有不只是數字的行
刪除 WEBVTT 標頭
刪除換行符
將多個空格壓縮為 1
刪除標籤
轉義任何反斜線（對於 json）
轉義任何雙引號（對於 json）

感謝 @steeldriver 修復了 Windows 新行。

我不會在生產中使用它，因為它有點弱，例如它會跳過諸如“你是--> 我的朋友”之類的文本行，可能還有其他一些情況，但它應該足以滿足我的目的（發佈到solr用於搜尋）

我意識到這是相當低效的。我希望得到這方面的建議。

相關內容