我想知道是否有一種簡單的方法,也許是一個襯墊,使用unix cli工具來沿著定義的時間偏移/中斷/差異(例如兩個小時)以毫秒精度(+%FT%T.%3NZ
例如2021-05-27T13:59:33.641Z
)分割帶有ISO-8601 UTC時間戳記的CSV檔。
一如既往,有某些不同的方式來獲得它,而對於有類似問題的其他用戶,其他選項也可能與全面的答案相關,我......
- ……使用/擁有 git 2.31.1's
GNU Bash 4.4.23
,GNU sed 4.8
,GNU Awk 5.0.0
(以及它所捆綁的所有其他工具),xsv 0.13.0
在jq 1.6
Windows 7 上 - ……寧願在互動式 shell 的腳本中使用它
- ……使用分號 (
;
) 作為分隔符,不使用逗號 - .... 做不是引用我的數值(例如,用單引號 (
'
) 或雙引號 ("
) 括起來) - ……沒有標題
- ……已經將整個 CSV 儲存在變數中,並且還希望將結果儲存在變數(陣列?)中,以便能夠進一步分析它們
- 我的專欄做不是實際上有固定的長度,除了字母數字字元外還可能包含空格和連字符
- 時間戳記是我的現實世界資料中八列中的第五列
- 可以假設該檔案最多為 250k 行和 20 MiB
- 雖然在我的 i5-4300U 上腳本/命令花費的時間少於半秒會更好,但最多 5 到 10 秒仍然不會破壞交易
例子
如果我有2 hours
用於分割的偏移量(並且我沒有混淆任何內容),則此文件:
abc;square;2021-05-27T14:15:39.315Z
def;circle;2021-05-27T14:17:03.416Z
ghi;triang;2021-05-27T14:45:13.520Z
abc;circle;2021-05-27T15:25:47.624Z
ghi;square;2021-05-27T17:59:33.641Z
def;triang;2021-05-27T18:15:33.315Z
abc;circle;2021-05-27T21:12:13.350Z
ghi;triang;2021-05-27T21:15:31.135Z
將分為以下三個部分
abc;square;2021-05-27T14:15:39.315Z
def;circle;2021-05-27T14:17:03.416Z
ghi;triang;2021-05-27T14:45:13.520Z
abc;circle;2021-05-27T15:25:47.624Z
ghi;square;2021-05-27T17:59:33.641Z
def;triang;2021-05-27T18:15:33.315Z
abc;circle;2021-05-27T21:12:13.350Z
ghi;triang;2021-05-27T21:15:31.135Z
免責聲明:我不是母語人士,所以如果改寫使這個問題更容易理解,請繼續。冗長的回覆。例如,也指定不適用於我的用例(逗號、引號)的選項,或在這個問題的文字中同時使用單字semicolon
和符號是為了 SEO 目的;
答案1
鑑於變數中的範例 CSV 資料$csv
:
gawk '
function timestamp2epoch(ts, m) {
if(match(ts, /([0-9]{4})-([0-9]{2})-([0-9]{2})T([0-9]{2}):([0-9]{2}):([0-9]{2})\..*/, m))
return mktime(m[1] " " m[2] " " m[3] " " m[4] " " m[5] " " m[6])
else
return -1
}
BEGIN {
FS = ";"
interval = 2 * 3600 # 2 hours
}
{ t = timestamp2epoch($3) }
t > start + interval { start = t; n++ }
{ batch[n] = batch[n] (batch[n] == "" ? "" : "/") $0 }
END {
PROCINFO["sorted_in"] = "@ind_num_asc"
for (i in batch)
print batch[i]
}
' <<<"$csv"
輸出
abc;square;2021-05-27T14:15:39.315Z/def;circle;2021-05-27T14:17:03.416Z/ghi;triang;2021-05-27T14:45:13.520Z/abc;circle;2021-05-27T15:25:47.624Z
ghi;square;2021-05-27T17:59:33.641Z/def;triang;2021-05-27T18:15:33.315Z
abc;circle;2021-05-27T21:12:13.350Z/ghi;triang;2021-05-27T21:15:31.135Z
可以將其讀入 shell 數組,例如:
mapfile -t batches < <(gawk '...' <<<"$csv")
declare -p batches
declare -a batches=([0]="abc;square;2021-05-27T14:15:39.315Z/def;circle;2021-05-27T14:17:03.416Z/ghi;triang;2021-05-27T14:45:13.520Z/abc;circle;2021-05-27T15:25:47.624Z" [1]="ghi;square;2021-05-27T17:59:33.641Z/def;triang;2021-05-27T18:15:33.315Z" [2]="abc;circle;2021-05-27T21:12:13.350Z/ghi;triang;2021-05-27T21:15:31.135Z")
然後對它們進行交互,如下所示:
for ((i = 0; i < "${#batches[@]}"; i++)); do
IFS="/" read -ra records <<<"${batches[i]}"
echo "batch $i"
for record in "${records[@]}"; do echo " $record"; done
echo
done
batch 0
abc;square;2021-05-27T14:15:39.315Z
def;circle;2021-05-27T14:17:03.416Z
ghi;triang;2021-05-27T14:45:13.520Z
abc;circle;2021-05-27T15:25:47.624Z
batch 1
ghi;square;2021-05-27T17:59:33.641Z
def;triang;2021-05-27T18:15:33.315Z
batch 2
abc;circle;2021-05-27T21:12:13.350Z
ghi;triang;2021-05-27T21:15:31.135Z
答案2
以下 perl 腳本將輸出輸入文件,每次看到不在上一個開始時間段 2 小時內的行時都會添加一個空行 - 將輸入分成最長持續時間為 2 小時的批次。
起始時間段在讀取第一行時設置,並且僅在打印額外的空白行時更新- 這是為了確保至少每2 小時生成一個新批次- 否則您的樣本輸入將僅分為兩個批次(來自的6 行) 14:15 到18:15,以及21:12 和21:15 的2 行),並且在16:45 處有一個額外的日誌條目,在20:00 處有另一個日誌條目將防止樣本輸入的任何拆分。
它從輸入的第三個欄位取得日期和時間 - 請注意,perl 陣列從零而不是一開始,$F[2]
array 的第三個欄位也是如此@F
。
#!/usr/bin/perl
use strict;
use Date::Parse;
my $start;
while(<>) {
chomp;
my $approx;
my @F = split /;/;
# approximate date/time to start of hour
($approx = $F[2]) =~ s/:\d\d:\d\d\.\d+Z$/:00:00/;
my $now = str2time($approx);
$start = $now if ($. == 1);
if (($now - $start) > 7200) {
$start = $now;
print "\n";
};
print "$_\n";
}
範例輸出:
$ ./split.pl input.csv
abc;square;2021-05-27T14:15:39.315Z
def;circle;2021-05-27T14:17:03.416Z
ghi;triang;2021-05-27T14:45:13.520Z
abc;circle;2021-05-27T15:25:47.624Z
ghi;square;2021-05-27T17:59:33.641Z
def;triang;2021-05-27T18:15:33.315Z
abc;circle;2021-05-27T21:12:13.350Z
ghi;triang;2021-05-27T21:15:31.135Z
如果您需要在單獨的文件中輸出,您可以這樣做:
#!/usr/bin/perl
use strict;
use Date::Parse;
my $start;
# output-file counter
my $fc = 1;
my $outfile = "file.$fc.csv";
open (my $fh, ">", $outfile) || die "couldn't open $outfile for write: $!\n";
while(<>) {
chomp;
my $approx;
my @F = split /;/;
# approximate date/time to start of hour
($approx = $F[2]) =~ s/:\d\d:\d\d\.\d+Z$/:00:00/;
my $now = str2time($approx);
$start = $now if ($. == 1);
if (($now - $start) > 7200) {
$start = $now;
close($fh);
$fc++;
$outfile = "file.$fc.csv";
open ($fh, ">", $outfile) || die "couldn't open $outfile for write: $!\n";
};
print $fh "$_\n";
}
如果您希望任一版本的腳本在其可以處理的時間格式方面更加靈活,請使用:
($approx = $F[2]) =~ s/:\d\d:\d\d(?:\.\d+)?Z?$/:00:00/;
這允許時間字串中的小數部分和 Z 都是可選的。
答案3
使用 GNU awk forgensub()
和mktime()
:
$ cat tst.awk
BEGIN {
FS = ";"
maxSecs = 2 * 60 * 60
prevTime = -(maxSecs + 1)
}
{
split($3,dt,/[.]/)
dateHMS = gensub(/[-T:]/," ","g",dt[1])
currSecs = mktime(dateHMS,1) "." dt[2]
secsDelta = currTime - prevTime
prevTime = currTime
}
secsDelta > maxSecs {
close(out)
out = "out" (++numOut)
}
{ print > out }
$ awk -f tst.awk file
$ head out?
==> out1 <==
abc;square;2021-05-27T14:15:39.315Z
def;circle;2021-05-27T14:17:03.416Z
ghi;triang;2021-05-27T14:45:13.520Z
abc;circle;2021-05-27T15:25:47.624Z
==> out2 <==
ghi;square;2021-05-27T17:59:33.641Z
def;triang;2021-05-27T18:15:33.315Z
==> out3 <==
abc;circle;2021-05-27T21:12:13.350Z
ghi;triang;2021-05-27T21:15:31.135Z
答案4
如果文件中的所有日期都屬於同一天:
#!/usr/bin/awk -f
BEGIN {
FS=OFS=";"
ho = 1
}
{
# Split the last field in date and times
split($NF, a, "T")
# Get the hour from time
h = a[2]
sub(/:.*$/, "", h)
if (lh == 0) lh = h+ho
if (h > lh) {
lh = h+ho
print "\n"
}
}1
您可以編輯腳本區塊ho
中的(小時偏移)BEGIN
,以在 csv 中拆分為其他小時偏移。
#!/usr/bin/awk -f
BEGIN {
FS=OFS=";"
# Set here the hour offset
hour_offset = 1
# Get the hour values in seconds
ho = 60 * 60 * hour_offset
}
{
sub(/Z$/, "", $NF)
# Call /bin/date and translate the 'visual date' to
# epoch timestamp.
cmd="/bin/date -d " $NF " +%s"
epoch=((cmd | getline line) > 0 ? line : -1)
close(cmd)
if (epoch == -1) {
print "Date throw an error at : " NR;
exit 1;
}
# If the lh (last hour) is not set, set it
# to the current value for the epoch time plus
# the chosen offset
if (!lh) lh = epoch + ho
# if the current offset less the the old hour processed is
# greater then the offset you choose: update the offset and
# print the separator
if (epoch - lh > ho) {
lh = epoch + ho
print ""
}
}1