Python统计pdf中英文单词的个数

之前的文章提供了批量识别pdf中英文的方法,详见【python爬虫】批量识别pdf中的英文,自动翻译成中文上

以及自动pdf英文转中文文档,详见【python爬虫】批量识别pdf中的英文,自动翻译成中文下

本文实现python统计pdf中字符的个数。

本文目录
  1. 要统计字符的pdf文档

  2. 识别pdf中的字符

  3. 定义统计单页pdf中字符个数的函数

  4. 统计pdf中字符的个数


一、要统计字符的pdf文档

首先看下要统计字符的pdf长什么样。

为了简单、清晰,本文以统计两页英文pdf字符为例进行阐述,代码直接可以应用到任意页数的英文pdf中。

二、识别pdf中的字符

接着应用pdfplumber库识别pdf中的字符,具体代码如下:

import pdfplumber as plb
file_path = r'F:\公众号\74_pdf英文翻译\murphy1996.pdf'#识别所有页的文字with plb.open(file_path) as pdf: k = 1 for page in pdf.pages: print(' ') print('第',k, '页') print(page.extract_text()) k += 1

参数详解:

file_pathpdf文件存放路径。

plb.open:打开pdf文件。

page.extract_text():获取该页pdf的文字内容。

得到结果:

第 1 页Medical and Pediatric Oncology 27:62-63 (1996)Ecthyma Gangrenosum Occurring at Sites of Iatrogenic Trauma in PediatricOncology Patients0.M urphy, MB, BCh, BAO, MRCPI, P.J. Marsh, BSC, MB, ChB, MRCPath,s.1.j. Gray, MB, ChB, MRCP, MARCPath, Pedler, MB, ChB, MRCPath, andj. Kernahan, MB, BS, FRcP(Ed) DCHWe report two cases of ecthyma gan- mary skin lesion. Both required prolongedgrenosum which occurred at sites of iatro- courses of antibiotics and one patient died.genic trauma. The first case developed due The different pathogenic mechanisms andto metastatic seeding with Pseudornonas outcomes associated with this condition areaeruginosa during an episode of septicaemia discussed. 01996 Wiley-Liss, Inc.and the second case occurred as a pri-Key words: ecthyma gangrenosum, Pseudomonas aeruginosa, iatrogenicINTRODUCTION ate. No further lesions developed during the remainder ofher treatment.Ecthyma gangrenosum (EG) is a well recognized cuta-neous manifestation of P.a eruginosa infections in immu- Case 2nocompromised patients [ 11. We report two cases of EGA 13-month-old girl was admitted for investigation ofoccurring at sites of iatrogenic trauma in pediatric oncol-pancytopenia. A diagnosis of aplastic anaemia was madeogy patients and demonstrate important pathogenic andfollowing left iliac crest marrow aspirate and trephineclinical features of this condition.bone biopsy. She became pyrexial on day 10 followingadmission but repeated blood cultures were negative. Onday 24, a 1 cm2 sloughing necrotic area surrounded byCASE REPORTS purplish erythema was noted at the bone marrow Sam-pling site. At this time her Hb was 6.6 g/dl and WCC wasCase 12.4 X 109/L( neutrophils 0.6 X 109/L). She was treatedA 2-year-old girl with acute lymphoblastic leukaemiaempirically with azlocillin and gentamicin. P. aerugi-was admitted with a fever, 2 weeks after a course ofnma was isolated from the lesion swab and a diagnosis ofchemotherapy which included intrathecal methotrexate.EG was made. Blood cultures remained sterile and radio-She was profoundly neutropenic (WCC 2.2 X lo9 /L, no logical examination did not reveal any evidence of bonyneutrophils). Physical, examination revealed a swollen,involvement. Despite prolonged antibiotic and topicalerythematous area with a central black eschar over thetherapy, the iliac crest lesion failed to improve. On daylumbar puncture site. She was commenced empirically32, she became pyrexial and Enterobacter sp. was iso-on imipenem-cilastatin and teicoplanin. Following isola-lated from two blood cultures. She was treated with intra-tion of P. aeruginosa from both blood cultures and lesionvenous gentamicin and ciprofloxacin. Throughout herswab, a diagnosis of EG was made and therapy wasillness she required numerous transfusions with plateletschanged to ceftazidime and amikacin. Radiological as-and red blood cells. A suitable bone marrow donor couldsessment of the lumbar spine did not reveal any evidenceof bony involvement. She became apyrexial on day 3 asher neutropenia began to recover. She did not requireFrom the Departments of Microbiology (O.M., P.J.M., J.G., S .J.P.),treatment with colony stimulating factors. Antimicrobialsand Child Health (J.K.), Royal Victoria Infirmary, Newcastle uponwere discontinued on day 17. Topical silver sulphadia- Tyne, UK.zine was continued for a further 4 weeks as the lesionReceived April 6, 1995; accepted August 21, 1995healed slowly by granulation from the base.Address reprint requests to 0. Murphy, M.B., B.Ch., B.A.O.,For subsequent chemotherapy, high dose intravenousM.R.C.P.I., Department of Microbiology, Royal Victoria Infirmary,methotrexate was substituted for intrathecal methotrex- Queen Victoria Road, Newcastle upon Tyne NEl 4LP, UK.0 1996 Wiley-Liss, Inc. 第 2 页EG at Sites of Iatrogenic Trauma 63not be found. A 2-week course of GMCSF was started on In case 1, we believe that seeding to an area of trauma-day 53 but no improvement in her haematological param- tised skin occurred during bacteraemia. Early recognitioneters was seen and her general condition continued to and aggressive treatment may have played a role in con-deteriorate. On day 85, she again became pyrexial and a trolling the primary septicaemia but recovery of the pa-1. O X 1.5 cm ulcer on her right labium majus was noted. tient’s bone marrow probably contributed more to theHer WCC was 0.4 X 109/L. P. aeruginosa was isolated long-term outcome. In case 2, repeated negative bloodfrom blood cultures for the first time. Despite aggressive cultures suggest that EG occurred as a primary lesion at aantibiotic and antifungal treatment, further lesions devel- site of prior skin trauma. Despite aggressive treatment,oped on her face and chest and she subsequently died. persistent profound neutropenia was associated with fail-ure of the lesion to resolve and the development of asecondary bacteraemia and further lesions.DISCUSSIONPaediatric oncology patients are frequently subject toAlthough not pathognomic, ecthyma gangrenosum is a invasive procedures involving minor skin trauma whichwell recognised manifestation of P. aeruginosa infection may predispose them to infection with various organismsin immunocompromised patients. Factors such as neutro- including P. aeruginosa. EG is an extremely difficultpenia, use of bread spectrum antibiotics, loss of skin condition to treat and a high index of suspicion in thisintegrity, and moist conditions have been shown to pre- at-risk population is required to ensure early diagnosisdispose to infection with P. aeruginosa and the develop- and optimum treatment.ment of EG [2]. Two possible pathogenic mechanisms inthe development of this condition have been postulated[2,3]. In classic or bacteraemic EG, the lesion is consid-ered to represent blood-borne metastatic seeding of P.aeruginosa to the skin. In non-bacteraemic or primary REFERENCESEG, the lesion is located at the site of entry of the organ-1. Dorff GJ, Geimer NF, Rosenthal DR, et al.: Pseudornonas septice-ism into the skin. In these cases the lesions have beenmia: illustrated evolution of its skin lesions. Arch Intern Med 128:found to occur more commonly in the distribution of 591, 1971.exocrine glands and secondary bacteraemia has rarely 2 El Baze P, Thyss A, Vinti H, Deville A, Dellamonica P, Ortonnebeen reported. Early diagnosis and aggressive therapy are J-P: A study of nineteen immunocompromised patients with exten-sive skin lesions caused by Pseudomonas aeruginosa with andimportant in the management of these patients. Althoughwithout bacteraemia. Acta Derm Venereol (Stockh) 71:411-415,patients with non-bacteraemic lesions have generally1991.been found to have a better prognosis than those with 3. Huminer D, Siegman-Igra Y, Morduchowicz G, Pitlik SD: Ec-bacteraemic EG [3,4], our experience of survival ulti- thyma gangrenosum without bacteraemia. Report of six cases and amately being determined by recovery of neutrophils con- review of the literature. Arch Intern Med 147:299-301, 1987.4. Fergie JE, Patrick CP, Lott L: Pseudomonas aeruginosa cellulitisfirms that of others [5].and ecthyma gangrenosum in imrnunocompromised children. Pedi-To our knowledge, these are the first reports of EG atr Infect Dis J 10:496-500, 1991.occurring at sites of iatrogenic trauma in paediatric oncol- 5. Greene SL, Daniel Su WP, Muller SA: Ecthyma gangrenosurn:ogy patients. The only previous report was in an adult report of clinical, histopathologic, and bacteriologic aspects ofwith AML who developed EG at the site of placement of eight cases. J Am Acad Dermatol 11:781-787, 1984.6. Klepflish A, Bembi A. Ecthyma gangrenosum caused by a rovingan ECG electrode [6]. In this case, skin trauma coincidedchest electrode in an acute myeloid leukaemia patient withwith a documented P. aeruginosa septicaemia and meta-Pseudomonas septicaemia [Letterl. J Am Acad Dermatol 18585-static seeding was felt to have occurred. 586, 1988.




三、定义统计单页pdf中字符个数的函数
应用正则表达式把单页内容处理成列表,并用filter函数过滤掉空值,再统计该页的字符数。

具体代码如下:

import re import random
def wd_num(pg): pg = pg pg_wd_num = len(list(filter(None, re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg)))) return pg_wd_num

参数详解:

re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg)以空格,换行符,逗号,句号,感叹号等为分隔符,把pg内容变成列表。

filter(None, ...):去掉列表中的空格。

len:求列表的长度。

为了大家理解得更透彻,按由内到外的方式逐层实现单页pdf字符统计。

首先是re.split函数调用,代码如下:

pg = '''Ecthyma Gangrenosum Occurring at Sites of Iatrogenic Trauma in Pediatric    Oncology Patients'''re.split(r"[\n|\s|,|!|.]", pg)
得到结果:
['Ecthyma', 'Gangrenosum', 'Occurring', 'at', 'Sites', 'of', 'Iatrogenic', 'Trauma', 'in', 'Pediatric', '', '', '', '', 'Oncology', 'Patients']
可以发现该函数按指定的分隔符把字符串分割成了一个list。
接着过滤掉list中的空值,代码如下:
list(filter(None, re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg)))
得到结果:
['Ecthyma', 'Gangrenosum', 'Occurring', 'at', 'Sites', 'of', 'Iatrogenic', 'Trauma', 'in', 'Pediatric', 'Oncology', 'Patients']

最后统计这个list的长度,即统计字符串中字符的个数,代码如下:

len(list(filter(None, re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg))))

得到结果:

12

可以手动核对一下,结果是一致的。


四、统计pdf中字符的个数
最后应用循环统计每一页的字符数量,以及整个pdf的字符数量,代码如下:
with plb.open(file_path) as pdf:    k = 1    sum_wd_num = 0    for page in pdf.pages:             print(' ')           pg = page.extract_text()        sum_wd_num += wd_num(pg)        print('第',k, '页有',wd_num(pg),'个字符')         k += 1print(' ')   print('总计有',sum_wd_num,'个字符') 
得到结果:
 第 1 页有 611 个字符第 2 页有 674 个字符 总计有 1285 个字符
至此,Python统计pdf中字符的个数已讲解完毕,需要的朋友可以自己跟着代码尝试一遍

限时免费进群】群内讨论学习Python、玩转Python、风控建模、人工智能、数据分析相关问题,还提供招聘内推信息、优秀文章、学习视频,也可交流工作中遇到的相关问题。需要的朋友添加微信号19967879837,加时备注想进的群,比如风控建模。

往期回顾:

一文囊括Python中的函数,持续更新。。。

一文囊括Python中的有趣案例,持续更新。。。

一文囊括Python中的数据分析与绘图,持续更新。。。

一文囊括风控模型搭建(原理+Python实现),持续更新。。。



限时免费加群

19967879837

添加微信号、手机号

请使用浏览器的分享功能分享到微信等