22f99965ecb4567cce7f389349a822477b0c4eaa,tests/test_basic.py,,test_looks_scanned,#,35
Before Change
CaseStudy_ACS.pdf contains a transparent image overlaying the entire page.
This overlaying transparent image fools TreeExtractor into thinking it is scanned.
output = pdftotree.parse("tests/input/CaseStudy_ACS.pdf", favor_figures="True")
assert output.count("ocrx_word") == 1 // single appearance in ocr-capabilities
output = pdftotree.parse("tests/input/CaseStudy_ACS.pdf", favor_figures="False")
assert output.count("ocrx_word") >= 1000
After Change
output = pdftotree.parse("tests/input/CaseStudy_ACS.pdf")
soup = BeautifulSoup(output)
assert len(soup.find_all(class_="ocrx_word")) >= 1000
assert len(soup.find_all("figure")) == 3
// Adapted from https://github.com/ocropus/hocr-tools/blob/v1.3.0/hocr-check
def get_prop(node: Tag, name: str) -> Optional[str]:
title = node.get("title")
In pattern: SUPERPATTERN
Frequency: 3
Non-data size: 3
Instances
Project Name: HazyResearch/pdftotree
Commit Name: 22f99965ecb4567cce7f389349a822477b0c4eaa
Time: 2020-10-06
Author: hiromu.hota@hal.hitachi.com
File Name: tests/test_basic.py
Class Name:
Method Name: test_looks_scanned
Project Name: kmike/pymorphy2
Commit Name: 585840d2a9a21c6a1a2f2dd9843d7a8e752bd1a2
Time: 2013-03-11
Author: kmike84@gmail.com
File Name: tests/test_analyzer.py
Class Name: TestHyphen
Method Name: test_no_hyphen_analyzer_for_known_prefixes
Project Name: pantsbuild/pants
Commit Name: e17dc893cbbcea4929c9d2315c588ad686be1934
Time: 2015-10-18
Author: john.sirois@gmail.com
File Name: tests/python/pants_test/engine/exp/test_scheduler.py
Class Name: SchedulerTest
Method Name: test_codegen_simple