Mesclar colunas em um arquivo com base no cabeçalho da coluna

Question 1

Se sua entrada for separada por tabulações:

awk -F"\t" '
NR == 1 {for (i=1; i<=NF; i++)  COL[i] = $i
        }
        {for (i=1; i<=NF; i++) OUT[NR, COL[i]] = $i
        }
END     {for (n=1; n<=NR; n++)  {split ("", DUP)
                                 for (i=1; i<=NF; i++)  if (!DUP[COL[i]]++) printf "%s" FS, OUT[n,COL[i]]
                                 printf RS
                                }
        }
' file
A   B   C   
1   5   4   
3   1   2   
2   2   1   
1       3   
3       2   
1       4

Ele salva os cabeçalhos das colunas para uso posterior como índices parciais e, para cada linha, coleta valores em uma matriz indexada pelo número da linha e pelo índice parcial do cabeçalho. Na ENDseção, ele imprime esse array na sequência original, cuidando dos cabeçalhos de coluna duplicados.

O tratamento duplicado pode se tornar um grande esforço para estruturas de arquivos mais complexas.

Answer

Se sua entrada for separada por tabulações:

awk -F"\t" '
NR == 1 {for (i=1; i<=NF; i++)  COL[i] = $i
        }
        {for (i=1; i<=NF; i++) OUT[NR, COL[i]] = $i
        }
END     {for (n=1; n<=NR; n++)  {split ("", DUP)
                                 for (i=1; i<=NF; i++)  if (!DUP[COL[i]]++) printf "%s" FS, OUT[n,COL[i]]
                                 printf RS
                                }
        }
' file
A   B   C   
1   5   4   
3   1   2   
2   2   1   
1       3   
3       2   
1       4

Ele salva os cabeçalhos das colunas para uso posterior como índices parciais e, para cada linha, coleta valores em uma matriz indexada pelo número da linha e pelo índice parcial do cabeçalho. Na ENDseção, ele imprime esse array na sequência original, cuidando dos cabeçalhos de coluna duplicados.

O tratamento duplicado pode se tornar um grande esforço para estruturas de arquivos mais complexas.

Question 2

para a entrada separada por tabulações.

ler o cabeçalho e os números das colunas correspondentes em um array onde eles apareceram no arquivo de entrada; em seguida, dividir o arquivo de entrada em cada coluna no mesmo nome de arquivo headerName.txt com o mesmo headerName. afinal, cole-os juntos ecolumncomando usado para embelezar a saída.

awk -F'\t' '
    ## find all the column number(s) when same header found and store in `h` array
    ## key is the column number and value is header name. for an example:
    ## for the header value 'A', keys will be columns 1 &4
    NR==1{ while (++i<=NF) h[i]=$i; next; }

         { for (i=1; i<=NF; i++) {

    ## save the field content to a file which its key column matches with the column 
    ## number of the current field. for an example:
    ## for the first field in column 1; the column number is 1, and so 1 is the key  
    ## column for header value A, so this will be written to "A.txt" filename
    ## only if it was not empty.
               if ($i!=""){ print $i> h[i]".txt" };
         }; }

    ## at the end paste those all files and beautify output with `column` command.
    ## number of .txt files above is limit to the number of uniq headers in your input. 
END{ system("paste *.txt |column \011 -tn") }' infile

comando sem comentários:

awk -F'\t' '
    NR==1{ while (++i<=NF) h[i]=$i; next; }
         { for (i=1; i<=NF; i++) {
               if ($i!=""){ print $i> h[i]".txt" };
         }; }
END{ system("paste *.txt |column \011 -tn") }' infile

Answer

para a entrada separada por tabulações.

ler o cabeçalho e os números das colunas correspondentes em um array onde eles apareceram no arquivo de entrada; em seguida, dividir o arquivo de entrada em cada coluna no mesmo nome de arquivo headerName.txt com o mesmo headerName. afinal, cole-os juntos ecolumncomando usado para embelezar a saída.

awk -F'\t' '
    ## find all the column number(s) when same header found and store in `h` array
    ## key is the column number and value is header name. for an example:
    ## for the header value 'A', keys will be columns 1 &4
    NR==1{ while (++i<=NF) h[i]=$i; next; }

         { for (i=1; i<=NF; i++) {

    ## save the field content to a file which its key column matches with the column 
    ## number of the current field. for an example:
    ## for the first field in column 1; the column number is 1, and so 1 is the key  
    ## column for header value A, so this will be written to "A.txt" filename
    ## only if it was not empty.
               if ($i!=""){ print $i> h[i]".txt" };
         }; }

    ## at the end paste those all files and beautify output with `column` command.
    ## number of .txt files above is limit to the number of uniq headers in your input. 
END{ system("paste *.txt |column \011 -tn") }' infile

comando sem comentários:

awk -F'\t' '
    NR==1{ while (++i<=NF) h[i]=$i; next; }
         { for (i=1; i<=NF; i++) {
               if ($i!=""){ print $i> h[i]".txt" };
         }; }
END{ system("paste *.txt |column \011 -tn") }' infile

Question 3

Uma abordagem um pouco diferente que não requer "armazenamento em buffer" de todo o arquivo:

Roteiro AWK colmerge.awk:

FNR==1{
    for (i=1; i<=NF; i++)
    {
    hdr[i]=$i;
    if (map[$i]==0) {map[$i]=i; uniq_hdr[++u]=$i; printf("%s",$i);}
    if (i==NF) printf("%s",ORS); else printf("%s",OFS);
    }
}

FNR>1{
    delete linemap;
    for (i=1; i<=NF; i++) if ($i!="") linemap[hdr[i]]=$i;
    for (i=1; i<=u; i++)
    {
    printf("%s",linemap[uniq_hdr[i]]);
    if (i==u) printf("%s",ORS); else printf("%s",OFS);
    }
}

Usar como

awk -F'\t' -v OFS='\t' -f colmerge.awk file

Isso reunirá todos os cabeçalhos e identificará os cabeçalhos "únicos" e sua primeira ocorrência na linha 1, e para cada linha sucessiva criará um mapa entre cabeçalhos e valores não vazios, que será então impresso na ordem dos cabeçalhos "únicos" conforme identificado durante o processamento da primeira linha.

No entanto, isso só funciona se o seu arquivo de entrada estiver separado por tabulações, pois essa é a única maneira de detectar campos "vazios" com segurança.

Observe também que a deleteinstrução para todo o array linemappode não ser suportada por todas awkas implementações (deve funcionar em e gawk, no entanto).mawknawk

Answer

Uma abordagem um pouco diferente que não requer "armazenamento em buffer" de todo o arquivo:

Roteiro AWK colmerge.awk:

FNR==1{
    for (i=1; i<=NF; i++)
    {
    hdr[i]=$i;
    if (map[$i]==0) {map[$i]=i; uniq_hdr[++u]=$i; printf("%s",$i);}
    if (i==NF) printf("%s",ORS); else printf("%s",OFS);
    }
}

FNR>1{
    delete linemap;
    for (i=1; i<=NF; i++) if ($i!="") linemap[hdr[i]]=$i;
    for (i=1; i<=u; i++)
    {
    printf("%s",linemap[uniq_hdr[i]]);
    if (i==u) printf("%s",ORS); else printf("%s",OFS);
    }
}

Usar como

awk -F'\t' -v OFS='\t' -f colmerge.awk file

Isso reunirá todos os cabeçalhos e identificará os cabeçalhos "únicos" e sua primeira ocorrência na linha 1, e para cada linha sucessiva criará um mapa entre cabeçalhos e valores não vazios, que será então impresso na ordem dos cabeçalhos "únicos" conforme identificado durante o processamento da primeira linha.

No entanto, isso só funciona se o seu arquivo de entrada estiver separado por tabulações, pois essa é a única maneira de detectar campos "vazios" com segurança.

Observe também que a deleteinstrução para todo o array linemappode não ser suportada por todas awkas implementações (deve funcionar em e gawk, no entanto).mawknawk

Mesclar colunas em um arquivo com base no cabeçalho da coluna

Responder1

Responder2

Responder3

informação relacionada