bash 또는 쉘을 사용하여 csv 파일에 대한 SQL 작업

Question 1

사용csvkit,

$ csvsql -H --query 'SELECT a,min(b),max(c),d FROM file GROUP BY a' file.csv
a,min(b),max(c),d
164318,1449,1457,1922
841422,1221,1228,1860
842179,2115,2118,1485
846354,1512,1513,1590

이렇게 하면 CSV 데이터를 임시 데이터베이스(기본적으로 SQLite라고 생각함)에 로드한 다음 주어진 SQL 쿼리를 여기에 적용합니다. 테이블은 기본적으로 입력 파일과 이름이 동일하며(접미사 없음) 데이터에 열 헤더가 없기 때문에 기본 필드 이름은 알파벳순입니다.

옵션 은 데이터에 열 머리글이 없음을 -H나타냅니다 .csvsql

출력에서 생성된 헤더를 삭제하려면 sed '1d'.

0으로 채워진 정수를 얻으려면:

$ csvsql -H --query 'SELECT printf("%07d,%06d,%06d,%06d",a,min(b),max(c),d) FROM file GROUP BY a' file.csv
"printf(""%07d,%06d,%06d,%06d"",a,min(b),max(c),d)"
"0164318,001449,001457,001922"
"0841422,001221,001228,001860"
"0842179,002115,002118,001485"
"0846354,001512,001513,001590"

여기서는 실제로 각 결과 레코드에 대해 단일 출력 필드만 요청하기 때문에 해당 행이 인용됩니다(여기에는 쉼표가 포함되어 있습니다). 이를 수행하는 또 다른 방법은 좀 더 많은 입력이 필요하지만 추가 큰따옴표를 생성하지 않는 것입니다.

$ csvsql -H --query 'SELECT printf("%07d",a),printf("%06d",min(b)),printf("%06d",max(c)),printf("%06d",d) FROM file GROUP BY a' file.csv
"printf(""%07d"",a)","printf(""%06d"",min(b))","printf(""%06d"",max(c))","printf(""%06d"",d)"
0164318,001449,001457,001922
0841422,001221,001228,001860
0842179,002115,002118,001485
0846354,001512,001513,001590

다시 말하지만, 결과를 sed '1d'.

Answer

사용csvkit,

$ csvsql -H --query 'SELECT a,min(b),max(c),d FROM file GROUP BY a' file.csv
a,min(b),max(c),d
164318,1449,1457,1922
841422,1221,1228,1860
842179,2115,2118,1485
846354,1512,1513,1590

이렇게 하면 CSV 데이터를 임시 데이터베이스(기본적으로 SQLite라고 생각함)에 로드한 다음 주어진 SQL 쿼리를 여기에 적용합니다. 테이블은 기본적으로 입력 파일과 이름이 동일하며(접미사 없음) 데이터에 열 헤더가 없기 때문에 기본 필드 이름은 알파벳순입니다.

옵션 은 데이터에 열 머리글이 없음을 -H나타냅니다 .csvsql

출력에서 생성된 헤더를 삭제하려면 sed '1d'.

0으로 채워진 정수를 얻으려면:

$ csvsql -H --query 'SELECT printf("%07d,%06d,%06d,%06d",a,min(b),max(c),d) FROM file GROUP BY a' file.csv
"printf(""%07d,%06d,%06d,%06d"",a,min(b),max(c),d)"
"0164318,001449,001457,001922"
"0841422,001221,001228,001860"
"0842179,002115,002118,001485"
"0846354,001512,001513,001590"

여기서는 실제로 각 결과 레코드에 대해 단일 출력 필드만 요청하기 때문에 해당 행이 인용됩니다(여기에는 쉼표가 포함되어 있습니다). 이를 수행하는 또 다른 방법은 좀 더 많은 입력이 필요하지만 추가 큰따옴표를 생성하지 않는 것입니다.

$ csvsql -H --query 'SELECT printf("%07d",a),printf("%06d",min(b)),printf("%06d",max(c)),printf("%06d",d) FROM file GROUP BY a' file.csv
"printf(""%07d"",a)","printf(""%06d"",min(b))","printf(""%06d"",max(c))","printf(""%06d"",d)"
0164318,001449,001457,001922
0841422,001221,001228,001860
0842179,002115,002118,001485
0846354,001512,001513,001590

다시 말하지만, 결과를 sed '1d'.

Question 2

사용csvkit:

csvsql -H --query "select a,min(b),max(c),d from file group by a,d" file.csv

이렇게 하면 선행 0이 잘립니다.

산출:

a,min(b),max(c),d
164318,1449,1457,1922
841422,1221,1228,1860
842179,2115,2118,1485
846354,1512,1513,1590

Answer

사용csvkit:

csvsql -H --query "select a,min(b),max(c),d from file group by a,d" file.csv

이렇게 하면 선행 0이 잘립니다.

산출:

a,min(b),max(c),d
164318,1449,1457,1922
841422,1221,1228,1860
842179,2115,2118,1485
846354,1512,1513,1590

Question 3

밀러(http://johnkerl.org/miller/doc), 사용

mlr --ocsv --quote-all --inidx --ifs , cat inputFile | \
mlr --ocsv --quote-none  --icsvlite stats1 -g '"1"' -a min,max,min -f '"2","3","4"' \
then cut -f '"1","2"_min,"3"_max,"4"_min' \
then label id,col2,col3,col4 | sed 's/"//g'

당신은

id,col2,col3,col4
0164318,001449,001457,001922
0842179,002115,002118,001485
0846354,001512,001513,001590
0841422,001221,001228,001860

Answer

밀러(http://johnkerl.org/miller/doc), 사용

mlr --ocsv --quote-all --inidx --ifs , cat inputFile | \
mlr --ocsv --quote-none  --icsvlite stats1 -g '"1"' -a min,max,min -f '"2","3","4"' \
then cut -f '"1","2"_min,"3"_max,"4"_min' \
then label id,col2,col3,col4 | sed 's/"//g'

당신은

id,col2,col3,col4
0164318,001449,001457,001922
0842179,002115,002118,001485
0846354,001512,001513,001590
0841422,001221,001228,001860

Question 4

SQL을 기본 절차적 작업으로 분류하고 이를 쉘 스크립트에 복제할 수 있습니다.

물론 이는 좋은 아이디어는 아닙니다. 선언적 언어(SQL과 같은)의 장점 중 하나는 개발자에게 절차적 구현의 장황함과 복잡성을 숨겨서 데이터에 집중할 수 있다는 점입니다. (최적화는 절차적 프로그램으로 복제하면 손실되는 선언적 언어의 두 번째 큰 장점입니다.)
또한 이 접근 방식은 문제가 있습니다.쉘 루프에서 텍스트를 처리하는 것은 일반적으로 나쁜 습관으로 간주됩니다..

그러나 다음은 많은 시스템에 사전 설치된 표준 유틸리티를 활용하는 쉘 스크립트의 예입니다(배열 구조는 제외 - POSIX에 지정되지 않았지만 널리 사용 가능하며 요청한 이후 확실히 사용 가능함 bash). :

#!/bin/bash

# The input file will be passed as the first argument
file="$1"

# For each input line:
# We take only the values of the first field, sort them, remove duplicates
for i in $(cut -d ',' -f 1 "$file" | sort -n -u); do

    # Resetting the array is not really needed; we do it for safety
    out=()

    # The first field of the output row is the key of the loop
    out[0]="$i"

    # We only consider the rows whose first field is equal
    # to the current key (grep) and...

    # ... we sort the values of the second field
    # in ascending order and take only the first one
    out[1]="$(grep "^${out[0]}" "$file" | cut -d ',' -f 2 | sort -n | head -n 1)"

    # ... we sort the values of the third field in
    # ascending order and take only the last one
    out[2]="$(grep "^${out[0]}" "$file" | cut -d ',' -f 3 | sort -n | tail -n 1)"

    # ... we sort the values of the fourth field in
    # ascending order and take only the first one
    out[3]="$(grep "^${out[0]}" "$file" | cut -d ',' -f 4 | sort -n | head -n 1)"

    # Finally we print out the output, separating fields with ','
    printf '%s,%s,%s,%s\n' "${out[@]}"

done

다음과 같이 호출되도록 되어 있습니다.

./script file

이 스크립트는 다음과 같습니다.

SELECT col1, MIN(col2), MAX(col3), MIN(col4)
FROM text
GROUP BY col1
ORDER BY col1

Answer