첫 번째 페이지 TIFF에서만 hocr 추출로 출력 가져오기

Question

편집하다:

filename대신 얻을 수 있는지 확인했습니다 .PILLOW.Image

text = pt.image_to_pdf_or_hocr('D:\\input\\Best time to visit.tiff', extension='hocr', config=(r'--oem 3 --psm 6'), lang="eng")

tesseract따라서 원본과 함께 실행될 수 tiff있으며 모든 페이지를 하나의 텍스트로 변환합니다 hocr.

원래 답변:

내 댓글의 링크에서 귀하와 코드를 가져와 tiff모든 페이지를 별도의 파일에 저장하는 코드를 만들었습니다. img.seek(page)페이지를 선택하는데 사용됩니다 . 그리고 그것은 당신의 파일과 함께 작동합니다.

from PIL import Image
import os

folder = '/home/furas/Desktop'
filename = 'Best time to visit.tiff'

img = Image.open(os.path.join(folder, filename))

page = 0

while True:
    try:
        img.seek(page)

        filename = f'page-{page+1}.png'
        print('saving...', filename)

        img.save(os.path.join(folder, filename))

        page += 1
    except EOFError:
        # Not enough frames in img
        break

귀하의 코드에서 비슷한 것이 나에게 효과적입니다.

from PIL import Image
import pytesseract as pt
import os

pt.pytesseract.tesseract_cmd = r'C:\Users\admin\AppData\Local\Programs\Tesseract-OCR\tesseract.exe'
     
# path for the folder for getting the raw images
path = "D:\\input"

# path for the folder for getting the output
tempPath = "D:\\output"

# iterating the images inside the folder
for imageName in os.listdir(path):
 
    # only images   
    if imageName.lower().endswith(('.tiff', '.jpg', '.png')):
        print(imageName)
        
        inputPath = os.path.join(path, imageName)
        img = Image.open(inputPath)
    
        page = 0
        while True:
            try:
        
                img.seek(page)
                text = pt.image_to_pdf_or_hocr(img, extension='hocr', config=(r'--oem 3 --psm 6'), lang="eng")
        
                print('page...', page)
                page += 1
         
                fullTempPath = os.path.join(tempPath, f"time_{imageName}_{page}.hocr")
                #print(text)
        
                # saving the text for every image in a separate .hocr file
                file1 = open(fullTempPath, "wb")
                file1.write(text)
                file1.close()
            except EOFError:
                # Not enough frames in img
                break

하나의 파일에 .hocr여러 페이지를 쓰려고 하면 페이지가 깨질 수 있으므로 모든 페이지를 나누어서 써야 합니다..hocr.hocr

모든 페이지를 하나의 파일에 쓰려면 일반 텍스트를 사용해야 합니다.

Answer 1