如何將固定寬度、垂直方向的鍵值對放入 CSV 檔案中?

如何將固定寬度、垂直方向的鍵值對放入 CSV 檔案中?

我有一個包含以下類型內容的文字檔案:

OPERATION_CONTEXT VMD1HTE1A71_ns:.oc.GJ_OAD2 alarm_object 1130 On director: VMD1HTE1A71_ns:.temip.VMD1HTE1A71_director AT Fri 18 Oct 2013 06:56:39 All Attributes

                         Identifier = 1130
                              State = Terminated
                     Problem Status = Closed
              Clearance Report Flag = True
                    Escalated Alarm = False
              Close User Identifier = "Auto-Clear"
        Termination User Identifier = "Auto-Clear"
                   Close Time Stamp = Fri 18 Oct 2013 05:01:46
             Termination Time Stamp = Fri 18 Oct 2013 05:01:46
                 Creation Timestamp = Fri 18 Oct 2013 04:37:29
               Clearance Time Stamp = Fri 18 Oct 2013 05:01:40
        Last Modification Timestamp = Fri 18 Oct 2013 05:01:46
                     Previous State = Outstanding
                     Managed Object = Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD "AMS" Node "INGJJMGRJMTSNB0001AG2OLT001"
                    Target Entities = { Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD "AMS" Node "INGJJMGRJMTSNB0001AG2OLT001" }
                         Alarm Type = CommunicationsAlarm
                         Event Time = Fri 18 Oct 2013 05:01:40
                     Probable Cause = Unknown
                  Specific Problems = {  }
            Notification Identifier = 160315
                             Domain = Domain VMD1HTE1A71_ns:.dm.GJ_OAD2
                       Alarm Origin = IncomingAlarm
                 Perceived Severity = Major
                    Additional Text = "
                                                                  nativeProbableCause: Attempt Threshold Crossed
                                                                  osTime: 20131018163727.250+0530
                                                                  neTime: 20131011174021.0+0530
                                                                  notificationId: AMS:160315
                                                                  portNumber:
                                                                  ftpNumber:
                                                                  meNm: INGJJMGRJMTSNB0001AG2OLT001
                                                                  mdNm: AMS
                                                                  objectType: OT_MANAGED_ELEMENT
                                                                  aliasValue: MGMT Security
                                      Access:INGJJMGRJMTSNB0001AG2OLT001:IP10.70.6.6.T0.S841 "
                  Original Severity = Major
                Original Event Time = Fri 11 Oct 2013 05:40:21

我想從此文字文件建立一個CSV 文件,其中列標題作為每行中= 之前的值,例如標識符、狀態、問題狀態以及包含每個列標題下的值的所有下一行,例如1130、終止、關閉等等。

我在此遇到的另一個複雜問題是,某些欄位具有換行符,例如附加文字。我想獲取“附加文字列”下的一列中“附加文字”的所有值。

由於我是 Linux/Unix 新手,我無法找到一種方法來做到這一點。做這個的最好方式是什麼?

答案1

好吧,如果每個記錄始終具有相同數量的字段,並且記錄之間沒有任何內容(我根據您的帖子所做的假設可能正確,也可能不正確),那麼您可以採用 awk 路線。這將保留列順序和嵌入的換行符。假設以下內容位於parse.awk

BEGIN {
    RS       = "( = |\n\\s+)";
    isHeader = 0;
    Sep      = "\",\"";
    Q        = "\"";
    # WinEOL   = "\r"; # enable this if your CSV will be used on Windows
    Headers  = Fields = Q;
}

function sanitise (Entry) {
    gsub(/(^[ "]*|[" \n]*$)/, "", Entry); # Trim leading/trailing double quotes and white space
    gsub(/"/, "\"\"", Entry); # Escape double quotes
    return Entry;
}

function addField (Field) {
    Fields    = Fields FieldsSep sanitise(Field);
    isHeader  = 1;
    FieldsSep = Sep;
    FieldCounter++
}

function addHeader (Header) {
    Headers = Headers HeadersSep sanitise($0);
    isHeader = 0;
    HeadersSep = Sep;
}

1 == NR {                   # Special case of first header
    addHeader($1);
    next;
}

$0 == "\"" {                # Fields with newlines
    LongField    = $0;
    LongFieldSep = "";
    while (getline > 0) {
        LongField    = LongField LongFieldSep $0;
        LongFieldSep = "\n";
        if ($NF ~ /"$/) {
            addField(LongField);
            next;
        }
    }
}
{
    if (isHeader) {
        addHeader($0);
    }
    else {
        addField($0);
    }

    if (FieldsPerRecord == FieldCounter) {
        if (!HeadersPrinted) {
            print Headers Q WinEOL;
            HeadersPrinted = 1
        }
        print Fields Q WinEOL;
        Fields = FieldsSep = "";
        FieldCounter = 0
    }
}

然後你可以FieldsPerRecord在命令列上使用 set 呼叫它:

$ awk -v FieldsPerRecord=26 -f parse.awk data.csv

這將產生以下 CSV 編碼的數據,LibreOffice Calc 似乎可以毫無問題地接受這些數據:

"Identifier","State","Problem Status","Clearance Report Flag","Escalated Alarm","Close User Identifier","Termination User Identifier","Close Time Stamp","Termination Time Stamp","Creation Timestamp","Clearance Time Stamp","Last Modification Timestamp","Previous State","Managed Object","Target Entities","Alarm Type","Event Time","Probable Cause","Specific Problems","Notification Identifier","Domain","Alarm Origin","Perceived Severity","Additional Text","Original Severity","Original Event Time"
"1130","Terminated","Closed","True","False","Auto-Clear","Auto-Clear","Fri 18 Oct 2013 05:01:46","Fri 18 Oct 2013 05:01:46","Fri 18 Oct 2013 04:37:29","Fri 18 Oct 2013 05:01:40","Fri 18 Oct 2013 05:01:46","Outstanding","Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD ""AMS"" Node ""INGJJMGRJMTSNB0001AG2OLT001","{ Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD ""AMS"" Node ""INGJJMGRJMTSNB0001AG2OLT001"" }","CommunicationsAlarm","Fri 18 Oct 2013 05:01:40","Unknown","{  }","160315","Domain VMD1HTE1A71_ns:.dm.GJ_OAD2","IncomingAlarm","Major","nativeProbableCause: Attempt Threshold Crossed
osTime: 20131018163727.250+0530
neTime: 20131011174021.0+0530
notificationId: AMS:160315
portNumber:
ftpNumber:
meNm: INGJJMGRJMTSNB0001AG2OLT001
mdNm: AMS
objectType: OT_MANAGED_ELEMENT
aliasValue: MGMT Security
Access:INGJJMGRJMTSNB0001AG2OLT001:IP10.70.6.6.T0.S841","Major","Fri 11 Oct 2013 05:40:21"

請注意,我採取的是引用一切至少對我來說,這種方法在導入時產生的驚喜較少,但您可以透過設定Q = ""和中的Sep = ","兩行來停用此功能。gsub()sanitise()

然而, 我認為這是一個正規表達式問題。這些數據是固定寬度的,所以看起來珀爾的unpack可能是最好的方法。我一直無法理解它,但這對我來說可能是一個很好的機會,可以了解是否有人想展示一種方法來做到這一點unpack

更新

我不是 Perl Hacker™,但以下內容似乎運作良好,對多行字段的內容不做任何假設,保留字段順序和字段內的所有原始間距(但刪除標題中的前導空格),並且出現 perl -在我未經訓練的眼睛中顯得很漂亮:

BEGIN{
    our (@headers, @fields);
    our $headers_printed = 0;
}
my ($header, $field) = unpack("A36x2A*", $_); # magic!

if ("" eq $header) {            # Fields with newlines
    $fields[$#fields] .= "\n" . $field;
    next;
}

push(@headers, $header =~ s/^\s*//gr);
push(@fields, $field);

if (26 == $#headers + 1) {      # Print complete record
    printf "%s\n", join ",", @headers  unless $headers_printed;
    $headers_printed = 1;
    printf "%s\n", join ",", @fields;
    @fields = @headers = ();
}

只需調用:

$ perl -nf /tmp/parse.pl /tmp/data.txt
Identifier,State,Problem Status,Clearance Report Flag,Escalated Alarm,Close User Identifier,Termination User Identifier,Close Time Stamp,Termination Time Stamp,Creation Timestamp,Clearance Time Stamp,Last Modification Timestamp,Previous State,Managed Object,Target Entities,Alarm Type,Event Time,Probable Cause,Specific Problems,Notification Identifier,Domain,Alarm Origin,Perceived Severity,Additional Text,Original Severity,Original Event Time
1130,Terminated,Closed,True,False,"Auto-Clear","Auto-Clear",Fri 18 Oct 2013 05:01:46,Fri 18 Oct 2013 05:01:46,Fri 18 Oct 2013 04:37:29,Fri 18 Oct 2013 05:01:40,Fri 18 Oct 2013 05:01:46,Outstanding,Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD "AMS" Node "INGJJMGRJMTSNB0001AG2OLT001",{ Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD "AMS" Node "INGJJMGRJMTSNB0001AG2OLT001" },CommunicationsAlarm,Fri 18 Oct 2013 05:01:40,Unknown,{  },160315,Domain VMD1HTE1A71_ns:.dm.GJ_OAD2,IncomingAlarm,Major,"
                            nativeProbableCause: Attempt Threshold Crossed
                            osTime: 20131018163727.250+0530
                            neTime: 20131011174021.0+0530
                            notificationId: AMS:160315
                            portNumber:
                            ftpNumber:
                            meNm: INGJJMGRJMTSNB0001AG2OLT001
                            mdNm: AMS
                            objectType: OT_MANAGED_ELEMENT
                            aliasValue: MGMT Security
Access:INGJJMGRJMTSNB0001AG2OLT001:IP10.70.6.6.T0.S841 ",Major,Fri 11 Oct 2013 05:40:21

使用 可能會更好Text::CSV,但我更感興趣的是了解如何unpack工作。對於固定寬度的資料來說,它似乎比正規表示式更具可讀性和健全性。

答案2

或者您可以使用 Perl 的正規表示式子程式:

my $grammar = qr!
    ( ?(DEFINE)
       (?<Identifier> [^=\n]+ )
       (?<Statement>
           (?: # Begin alternation
               " #Opening quotes
               [^"]+? # Any non-quotes (including a new line)
               " # Closing quotes
              | [^\n]+ # Or a single line
           )   # End alternation
        )   

   )

!x;

my $file = do { local $/; <> }; #Slurp file named on command line
my %columns;
while( $file =~ 
   m{ ((?&Identifier))[\t ]*=[ \t]*((?&Statement)) $grammar}xgc )
{ 
   my ($header,$value) = ($1,$2);

       # Remove leading spaces and quote variable if it contains commas:
   for($header,$value) { s/^\s+//mg; /,/ and s/^|$/"/g }

       # Substitute \n with \\n to make multi-line values single-line:
   for($value) { chomp; s/\n/\\n/g }

   $columns{$header}=$value
}

print join "," => sort keys %columns; # Print column headers
print "\n";
print join "," => map { $columns{$_} } sort keys %columns; # Column content
print "\n";

像這樣呼叫它:

[user@host]$ /path/to/script.pl /path/to/file.txt

它將列印 CSV 格式的表到標準輸出

這假設多行語句"除了開頭和結尾之外不包含雙引號 ( )。

答案3

好吧,這不太漂亮,但按照你的要求做。我用 Perl 編寫了一個腳本,它將獲取上述文件並解析它,然後使用該模組Text::CSV將其轉換為 CSV 格式。

劇本

#!/usr/bin/env perl

use Text::CSV;

open(my $fh, "<data.txt");
@lines = <$fh>;
close ($fh);

my (%csv, $name, $val);

foreach my $line (@lines) {
  if ($line =~ m/=/) {
    chomp($line);
        $line =~ s/^\s+//g;
    ($name, $val) = split(/ = /, $line);
        $val =~ s/^"$//;
        $csv{$name} = $val;
  } else {
        $line =~ s/^\s+//g;
        $line =~ s/\s+$/\\n/g;
        $line =~ s/ "\\n$//;
        $csv{$name} .= $line;
  }
}

my @vals;
foreach my $i (sort keys %csv) {
  push(@vals, $csv{$i});
}

my $ccsv = Text::CSV->new();
$ccsv->combine(sort keys %csv);
$ccsv->parse($ccsv->string());
print $ccsv->string() . "\n";
$ccsv->combine(@vals);
$ccsv->parse($ccsv->string());
print $ccsv->string() . "\n";

例子

只需像這樣運行它:

$ ./csv.pl
"Additional Text","Alarm Origin","Alarm Type","Clearance Time Stamp","Close Time Stamp","Creation Timestamp",Domain,"Event Time","Last Modification Timestamp","Managed Object","Notification Identifier","Original Event Time","Original Severity","Perceived Severity","Previous State","Probable Cause","Specific Problems","Target Entities","Termination Time Stamp"
"nativeProbableCause: Attempt Threshold Crossed\nosTime: 20131018163727.250+0530\nneTime: 20131011174021.0+0530\nnotificationId: AMS:160315\nportNumber:\nftpNumber:\nmeNm: INGJJMGRJMTSNB0001AG2OLT001\nmdNm: AMS\nobjectType: OT_MANAGED_ELEMENT\naliasValue: MGMT Security\nAccess:INGJJMGRJMTSNB0001AG2OLT001:IP10.70.6.6.T0.S841",IncomingAlarm,CommunicationsAlarm,"Fri 18 Oct 2013 05:01:40","Fri 18 Oct 2013 05:01:46","Fri 18 Oct 2013 04:37:29","Domain VMD1HTE1A71_ns:.dm.GJ_OAD2","Fri 18 Oct 2013 05:01:40","Fri 18 Oct 2013 05:01:46","Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD ""AMS"" Node ""INGJJMGRJMTSNB0001AG2OLT001""",160315,"Fri 11 Oct 2013 05:40:21",Major,Major,Outstanding,Unknown,"{  }","{ Alcatel_5529OAD VMD1HTE1A71_ns:.OAD2 MD ""AMS"" Node ""INGJJMGRJMTSNB0001AG2OLT001"" }","Fri 18 Oct 2013 05:01:46"

讓我知道您的想法,或者如果您在運行時遇到問題。如果它滿足您的需要,我將填寫其工作原理的詳細資訊。

參考

相關內容