我想將帶有一些彩色文字和圖像的pdf轉換為另一個只有黑白的pdf,以減少其尺寸。此外,我想將文本保留為文本,而不轉換圖片中的頁面元素。我嘗試了以下命令:
convert -density 150 -threshold 50% input.pdf output.pdf
在另一個問題中發現,一條連結,但它做了我不想要的事情:輸出中的文字被轉換為較差的圖像並且不再可選。我嘗試使用 Ghostscript:
gs -sOutputFile=output.pdf \
-q -dNOPAUSE -dBATCH -dSAFER \
-sDEVICE=pdfwrite \
-dCompatibilityLevel=1.3 \
-dPDFSETTINGS=/screen \
-dEmbedAllFonts=true \
-dSubsetFonts=true \
-sColorConversionStrategy=/Mono \
-sColorConversionStrategyForImages=/Mono \
-sProcessColorModel=/DeviceGray \
$1
但它給了我以下錯誤訊息:
./script.sh: 19: ./script.sh: output.pdf: not found
還有其他方法來建立該文件嗎?
答案1
GS 範例
gs
您上面執行的命令有一個尾隨,$1
通常用於將命令列參數傳遞到腳本中。所以我不確定您實際嘗試過什麼,但我猜測您嘗試將該命令放入腳本中script.sh
:
#!/bin/bash
gs -sOutputFile=output.pdf \
-q -dNOPAUSE -dBATCH -dSAFER \
-sDEVICE=pdfwrite \
-dCompatibilityLevel=1.3 \
-dPDFSETTINGS=/screen \
-dEmbedAllFonts=true \
-dSubsetFonts=true \
-sColorConversionStrategy=/Mono \
-sColorConversionStrategyForImages=/Mono \
-sProcessColorModel=/DeviceGray \
$1
並像這樣運行它:
$ ./script.sh: 19: ./script.sh: output.pdf: not found
不確定如何設定此腳本,但它需要可執行。
$ chmod +x script.sh
不過這個腳本肯定有什麼不對勁的地方。當我嘗試它時,我得到了這個錯誤:
不可恢復的錯誤:.putdeviceprops 中的 rangecheck
替代
我會使用 SU 問題中的這個腳本來代替該腳本。
#!/bin/bash
gs \
-sOutputFile=output.pdf \
-sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray \
-dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 \
-dNOPAUSE \
-dBATCH \
$1
然後像這樣運行它:
$ ./script.bash LeaseContract.pdf
GPL Ghostscript 8.71 (2010-02-10)
Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 2.
Page 1
Page 2
答案2
我找到了一個腳本這裡可以做到這一點。它要求gs
你似乎擁有但也pdftk
。您沒有提到您的發行版,但在基於 Debian 的系統上,您應該可以使用以下命令安裝它
sudo apt-get install pdftk
您可以找到它的 RPM這裡。
安裝後pdftk
,將腳本另存為graypdf.sh
並運行,如下所示:
./greypdf.sh input.pdf
它將創建一個名為input-gray.pdf
.我在此處包含整個腳本以避免連結失效:
# convert pdf to grayscale, preserving metadata
# "AFAIK graphicx has no feature for manipulating colorspaces. " http://groups.google.com/group/latexusersgroup/browse_thread/thread/5ebbc3ff9978af05
# "> Is there an easy (or just standard) way with pdflatex to do a > conversion from color to grayscale when a PDF file is generated? No." ... "If you want to convert a multipage document then you better have pdftops from the xpdf suite installed because Ghostscript's pdf to ps doesn't produce nice Postscript." http://osdir.com/ml/tex.pdftex/2008-05/msg00006.html
# "Converting a color EPS to grayscale" - http://en.wikibooks.org/wiki/LaTeX/Importing_Graphics
# "\usepackage[monochrome]{color} .. I don't know of a neat automatic conversion to monochrome (there might be such a thing) although there was something in Tugboat a while back about mapping colors on the fly. I would probably make monochrome versions of the pictures, and name them consistently. Then conditionally load each one" http://newsgroups.derkeiler.com/Archive/Comp/comp.text.tex/2005-08/msg01864.html
# "Here comes optional.sty. By adding \usepackage{optional} ... \opt{color}{\includegraphics[width=0.4\textwidth]{intro/benzoCompounds_color}} \opt{grayscale}{\includegraphics[width=0.4\textwidth]{intro/benzoCompounds}} " - http://chem-bla-ics.blogspot.com/2008/01/my-phd-thesis-in-color-and-grayscale.html
# with gs:
# http://handyfloss.net/2008.09/making-a-pdf-grayscale-with-ghostscript/
# note - this strips metadata! so:
# http://etutorials.org/Linux+systems/pdf+hacks/Chapter+5.+Manipulating+PDF+Files/Hack+64+Get+and+Set+PDF+Metadata/
COLORFILENAME=$1
OVERWRITE=$2
FNAME=${COLORFILENAME%.pdf}
# NOTE: pdftk does not work with logical page numbers / pagination;
# gs kills it as well;
# so check for existence of 'pdfmarks' file in calling dir;
# if there, use it to correct gs logical pagination
# for example, see
# http://askubuntu.com/questions/32048/renumber-pages-of-a-pdf/65894#65894
PDFMARKS=
if [ -e pdfmarks ] ; then
PDFMARKS="pdfmarks"
echo "$PDFMARKS exists, using..."
# convert to gray pdf - this strips metadata!
gs -sOutputFile=$FNAME-gs-gray.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH "$COLORFILENAME" "$PDFMARKS"
else # not really needed ?!
gs -sOutputFile=$FNAME-gs-gray.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH "$COLORFILENAME"
fi
# dump metadata from original color pdf
## pdftk $COLORFILENAME dump_data output $FNAME.data.txt
# also: pdfinfo -meta $COLORFILENAME
# grep to avoid BookmarkTitle/Level/PageNumber:
pdftk $COLORFILENAME dump_data output | grep 'Info\|Pdf' > $FNAME.data.txt
# "pdftk can take a plain-text file of these same key/value pairs and update a PDF's Info dictionary to match. Currently, it does not update the PDF's XMP stream."
pdftk $FNAME-gs-gray.pdf update_info $FNAME.data.txt output $FNAME-gray.pdf
# (http://wiki.creativecommons.org/XMP_Implementations : Exempi ... allows reading/writing XMP metadata for various file formats, including PDF ... )
# clean up
rm $FNAME-gs-gray.pdf
rm $FNAME.data.txt
if [ "$OVERWRITE" == "y" ] ; then
echo "Overwriting $COLORFILENAME..."
mv $FNAME-gray.pdf $COLORFILENAME
fi
# BUT NOTE:
# Mixing TEX & PostScript : The GEX Model - http://www.tug.org/TUGboat/Articles/tb21-3/tb68kost.pdf
# VTEX is a (commercial) extended version of TEX, sold by MicroPress, Inc. Free versions of VTEX have recently been made available, that work under OS/2 and Linux. This paper describes GEX, a fast fully-integrated PostScript interpreter which functions as part of the VTEX code-generator. Unless specified otherwise, this article describes the functionality in the free- ware version of the VTEX compiler, as available on CTAN sites in systems/vtex.
# GEX is a graphics counterpart to TEX. .. Since GEX may exercise subtle influence on TEX (load fonts, or change TEX registers), GEX is op- tional in VTEX implementations: the default oper- ation of the program is with GEX off; it is enabled by a command-line switch.
# \includegraphics[width=1.3in, colorspace=grayscale 256]{macaw.jpg}
# http://mail.tug.org/texlive/Contents/live/texmf-dist/doc/generic/FAQ-en/html/FAQ-TeXsystems.html
# A free version of the commercial VTeX extended TeX system is available for use under Linux, which among other things specialises in direct production of PDF from (La)TeX input. Sadly, it���s no longer supported, and the ready-built images are made for use with a rather ancient Linux kernel.
# NOTE: another way to capture metadata; if converting via ghostscript:
# http://compgroups.net/comp.text.pdf/How-to-specify-metadata-using-Ghostscript
# first:
# grep -a 'Keywo' orig.pdf
# /Author(xxx)/Title(ttt)/Subject()/Creator(LaTeX)/Producer(pdfTeX-1.40.12)/Keywords(kkkk)
# then - copy this data in a file prologue.ini:
#/pdfmark where {pop} {userdict /pdfmark /cleartomark load put} ifelse
#[/Author(xxx)
#/Title(ttt)
#/Subject()
#/Creator(LaTeX with hyperref package + gs w/ prologue)
#/Producer(pdfTeX-1.40.12)
#/Keywords(kkkk)
#/DOCINFO pdfmark
#
# finally, call gs on the orig file,
# asking to process pdfmarks in prologue.ini:
# gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
# -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -dDOPDFMARKS \
# -sOutputFile=out.pdf in.pdf prologue.ini
# then the metadata will be in output too (which is stripped otherwise;
# note bookmarks are preserved, however).
答案3
我還有一些掃描的彩色 pdf 和灰階 pdf,我想將其轉換為黑白。我嘗試gs
與此處列出的程式碼,圖像品質良好,pdf 文字仍然存在。但是,該 gs 程式碼僅轉換為灰階(如問題中所要求的)且檔案大小仍然很大。convert
直接使用時效果很差。
我想要具有良好圖像品質和較小文件大小的黑白 pdf。我本來想嘗試 terdon 的解決方案,但我無法pdftk
使用 yum 進入 centOS 7(在撰寫本文時)。
我的解決方案用於gs
從pdf中提取灰階bmp文件,convert
將這些bmp閾值設為bw並將它們保存為tiff文件,然後圖像2pdf壓縮 tiff 影像並將它們全部合併到一個 pdf 中。
我嘗試直接從 pdf 轉到 tiff,但品質不一樣,所以我將每個頁面保存為 bmp。對於一頁 pdf 文件,convert
從 bmp 到 pdf 的效果非常好。例子:
gs -sDEVICE=bmpgray -dNOPAUSE -dBATCH -r300x300 \
-sOutputFile=./pdf_image.bmp ./input.pdf
convert ./pdf_image.bmp -threshold 40% -compress zip ./bw_out.pdf
對於多頁,gs
可以將多個 pdf 檔案合併為一個,但img2pdf
產生的檔案大小比 gs 小。 tiff 檔案必須解壓縮作為 img2pdf 的輸入。請記住,對於大量頁面,中間 bmp 和 tiff 檔案的大小往往很大。pdftk
或者joinpdf
如果他們可以合併壓縮的 pdf 文件,那就更好了convert
。
我想有一個更優雅的解決方案。然而,我的方法產生的結果具有非常好的圖像品質和更小的檔案大小。若要將文字還原到黑白 pdf 中,請再次執行 OCR。
我的 shell 腳本使用 gs、convert 和 img2pdf。根據需要變更開頭列出的參數(頁數、掃描 dpi、閾值 % 等),然後執行chmod +x ./pdf2bw.sh
。這是完整的腳本(pdf2bw.sh):
#!/bin/bash
num_pages=12
dpi_res=300
input_pdf_name=color_or_grayscale.pdf
bw_threshold=40%
output_pdf_name=out_bw.pdf
#-------------------------------------------------------------------------
gs -sDEVICE=bmpgray -dNOPAUSE -dBATCH -q -r$dpi_res \
-sOutputFile=./%d.bmp ./$input_pdf_name
#-------------------------------------------------------------------------
for file_num in `seq 1 $num_pages`
do
convert ./$file_num.bmp -threshold $bw_threshold \
./$file_num.tif
done
#-------------------------------------------------------------------------
input_files=""
for file_num in `seq 1 $num_pages`
do
input_files+="./$file_num.tif "
done
img2pdf -o ./$output_pdf_name --dpi $dpi_res $input_files
#-------------------------------------------------------------------------
# clean up bmp and tif files used in conversion
for file_num in `seq 1 $num_pages`
do
rm ./$file_num.bmp
rm ./$file_num.tif
done
答案4
RHEL6 和 RHEL5(兩者都以 8.70 為 Ghostscript 的基線)無法使用上面給出的命令的形式。假設腳本或函數期望 PDF 檔案作為第一個參數“$1”,則以下內容應該更可移植:
gs \
-sOutputFile="grey_$1" \
-sDEVICE=pdfwrite \
-sColorConversionStrategy=Mono \
-sColorConversionStrategyForImages=/Mono \
-dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.3 \
-dNOPAUSE -dBATCH \
"$1"
輸出檔案將以“grey_”為前綴。
RHEL6和5都可以使用相容性等級=1.4這要快得多,但我的目標是便攜性。