4 PDF to Markdown

from pyhere import here
import sys
import os
from pathlib import Path
from openai import OpenAI
from IPython.display import display_markdown

sys.path.append(os.path.abspath('../..'))

api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)

4.1 Final Wrapper

from src.openai_tools import ocr_pdf_to_markdown

%%time
cochran_md = ocr_pdf_to_markdown(here("pdf/textbook/Cochran_1977_SamplingTechniques_Ch1.pdf"),
                                 model = "gpt-4o")

CPU times: user 1.24 s, sys: 57.9 ms, total: 1.3 s
Wall time: 3min 46s



file_path = Path(here("output/markdown/Cochran_1977_SamplingTechniques_Ch1.md"))
with open(file_path, 'w', encoding='utf-8') as file:
    # Write the string of text to the file
    file.write(cochran_md)

%%time
tirads_md = ocr_pdf_to_markdown(here("pdf/paper/TI-RADS_A User’s Guide.pdf"),
                                 model = "gpt-4o-2024-08-06")

CPU times: user 484 ms, sys: 29.8 ms, total: 514 ms
Wall time: 1min 24s

file_path = Path(here("output/markdown/TI-RADS_A-User-Guide.md"))
with open(file_path, 'w', encoding='utf-8') as file:
    # Write the string of text to the file
    file.write(tirads_md)

4.2 Extract: PDF -> Base64 Image

from src.pdftools import pdf_to_base64_images

cochran_img_base64 = pdf_to_base64_images(here("pdf/textbook/Cochran_1977_SamplingTechniques_Ch1.pdf"))
cochran_pg14_img_base64 = pdf_to_base64_images(here("pdf/textbook/Cochran_1977_SamplingTechniques_Ch1_pg14.pdf"))

4.3 LLM: OCR to Markdown

def ocr_single_image_to_markdown(base64_image, 
                                 model = "gpt-4o",
                                 md_format = "Github-flavored markdown",
                                 heading_lv_max = "H2"
                                 ):
    system_prompt = f"""
    You are an advanced OCR-based data extraction tool designed to convert text, tables, and structured content from images into {md_format}. Ensure the output retains the original layout and information integrity as closely as possible. Include headers, bullet points, or tables where appropriate, and optimize for readability in Markdown syntax.
    
    **Heading level:** The highest level of heading is {heading_lv_max}. 
    **LaTeX Math expression**
    - Inline: surround the inline expression with dollar symbols, for example: $1+1 = 2$
    - Blocks: delimit the block expression with two dollar symbols, for example:
      $$
      E = m \times c^2 
      $$
    
    Return markdown text output without enclosing in code block. If the image is blank or no appropriate content can be extracted, return empty text string ("").  
    """
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Convert data from this image to markdown text"},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}", "detail": "high"}}
                ]
            }
        ],
        temperature=0.0,
    )
    return response.choices[0].message.content

<>:18: SyntaxWarning: invalid escape sequence '\p'
<>:18: SyntaxWarning: invalid escape sequence '\p'
/var/folders/70/7wmmf6t55cb84bfx9g1c1k1m0000gn/T/ipykernel_70768/3346470672.py:18: SyntaxWarning: invalid escape sequence '\p'
  """

4.4 Extract One Image

4.4.1 First Page

cochran_img_md_0 = ocr_single_image_to_markdown(cochran_img_base64)

display_markdown(print(cochran_img_md_0))

## Sampling Techniques

*third edition*

**WILLIAM G. COCHRAN**  
*Professor of Statistics, Emeritus  
Harvard University*

**JOHN WILEY & SONS**  
New York • Chichester • Brisbane • Toronto • Singapore

4.4.2 Page with Equation & Table

cochran_pg14_md = ocr_single_image_to_markdown(cochran_pg14_img_base64)
display_markdown(print(cochran_pg14_md))

## Table 1.1  
**Effect of a Bias $B$ on the Probability of an Error Greater than 1.96$\sigma$**

| $B/\sigma$ | Probability of Error < -1.96$\sigma$ | > 1.96$\sigma$ | Total  |
|------------|--------------------------------------|----------------|--------|
| 0.02       | 0.0228                               | 0.0262         | 0.0500 |
| 0.04       | 0.0228                               | 0.0274         | 0.0502 |
| 0.06       | 0.0217                               | 0.0287         | 0.0504 |
| 0.08       | 0.0197                               | 0.0314         | 0.0511 |
| 0.10       | 0.0170                               | 0.0341         | 0.0511 |
| 0.20       | 0.0154                               | 0.0392         | 0.0546 |
| 0.40       | 0.0052                               | 0.0594         | 0.0646 |
| 0.60       | 0.0025                               | 0.0869         | 0.0894 |
| 0.80       | 0.0029                               | 0.1200         | 0.1229 |
| 1.00       | 0.0015                               | 0.1685         | 0.1700 |
| 1.50       | 0.0003                               | 0.3228         | 0.3231 |

For the total probability of an error of more than 1.96$\sigma$, the bias has little effect provided that it is less than one tenth of the standard deviation. At this point, the total probability is 0.0511 instead of the 0.05 that we think it is. As the bias increases further, the disturbance becomes more serious. At $B = \sigma$, the total probability of error is 0.17, more than three times the presumed value.

The two tails are affected differently. With a positive bias, as in this example, the probability of an underestimate by more than 1.96$\sigma$ shrinks rapidly from the presumed 0.025 to become negligible when $B = \sigma$. The probability of the corresponding overestimate mounts steadily. In most applications, the total error is the primary interest, but occasionally we are particularly interested in errors in one direction.

As a working rule, the effect of bias on the accuracy of an estimate is negligible if the bias is less than one tenth of the standard deviation of the estimate. If we have a biased method of estimation for which $B/\sigma < 0.1$, where $B$ is the absolute value of the bias, it can be claimed that the bias is not an appreciable disadvantage of the method.

cochran_pg14_md2 = ocr_single_image_to_markdown(cochran_pg14_img_base64, 
                                                model = "gpt-4o-mini")
display_markdown(print(cochran_pg14_md2))

## TABLE 1.1
### Effect of a Bias $B$ on the Probability of an Error Greater than 1.96σ

| $B/\sigma$ | Probability of Error |
|------------|----------------------|
|            | $< -1.96$           | $> 1.96$ | Total   |
| 0.02       | 0.028                | 0.026    | 0.050   |
| 0.04       | 0.028                | 0.027    | 0.050   |
| 0.06       | 0.021                | 0.023    | 0.044   |
| 0.08       | 0.017                | 0.034    | 0.051   |
| 0.10       | 0.015                | 0.039    | 0.054   |
| 0.20       | 0.005                | 0.069    | 0.074   |
| 0.40       | 0.002                | 0.086    | 0.088   |
| 0.60       | 0.0005              | 0.130    | 0.1305  |
| 1.00       | 0.00015             | 0.165    | 0.16515 |
| 1.50       | 0.0003              | 0.328    | 0.3283  |

For the total probability of an error of more than 1.96σ, the bias has little effect provided that it is less than one tenth of the standard deviation. At this point the total probability is 0.0511 instead of the 0.05 that we think it is. As the bias increases further, the disturbance becomes more serious. At $B = \sigma$, the total probability of error is 0.17, more than three times the presumed value.

The two tails are affected differently. With a positive bias, as in this example, the probability of an underestimate by more than 1.96σ shrinks rapidly from the presumed 0.025 to become negligibly when $B = \sigma$. The probability of the corresponding overestimate mounts steadily. In most applications, the total error is the primary interest, because usually we are particularly interested in errors in one direction.

As a working rule, the effect of bias on the accuracy of an estimate is negligible if the bias is less than one tenth of the standard deviation of the estimate. If we have a biased method of estimation for which $B/\sigma < 0.1$, where $B$ is the absolute value of the bias, it can be claimed that the bias is not an appreciable disadvantage of the error.

4.5 Extract Multiple Image

def ocr_image_to_markdown(base64_images: list[str] | str, **kwarg):
    """Convert one or multiple base64-encoded images to Markdown text."""
    # Single Image
    is_single_image = all([len(x) == 1 for x in base64_images])
    if is_single_image:
        md_text = ocr_single_image_to_markdown(base64_images, **kwarg)
        return md_text
    # Multiple Images
    try:
        
        md_text_ls = [ocr_single_image_to_markdown(base64_image, **kwarg) for base64_image in base64_images] 
        md_text_ls_rm_blank = list(filter(None, md_text_ls)) # Remove blank string ("")
        md_text = "\n\n---\n\n".join(md_text_ls_rm_blank)
        return md_text
    
    except Exception as e:
        print(f"Error extract Image: {e}")
        return []

4.5.1 Execute !

cochran_img_md_05 = ocr_image_to_markdown(cochran_img_base64[0:5], model = "gpt-4o-mini")

display_markdown(print(cochran_img_md_05))

## Sampling Techniques

### third edition

**WILLIAM G. COCHRAN**  
Professor of Statistics, Emeritus  
Harvard University

---

**JOHN WILEY & SONS**  
New York • Chichester • Brisbane • Toronto • Singapore

---

## Copyright

Copyright © 1977, by John Wiley & Sons, Inc.  
All rights reserved. Published simultaneously in Canada.  

Reproduction or translation of any part of this work beyond that permitted by Sections 107 or 108 of the 1976 United States Copyright Act without the permission of the copyright owner is unlawful. Requests for permission or further information should be addressed to the Permissions Department, John Wiley & Sons, Inc.

## Library of Congress Cataloging in Publication Data

Cochran, William Gemmell, 1909-  
Sampling techniques.  

(Wiley series in probability and mathematical statistics)  
Includes bibliographical references and index.  

1. Sampling (Statistics) 1. Title.  
QA276.6.C6 1977 001.4222 77-728  
ISBN 0-471-16240-X  

Printed in the United States of America  

40 39 38 37 36  

---

to Betty

---

## Preface

As did the previous editions, this textbook presents a comprehensive account of sampling theory as it has been developed for use in sample surveys. It contains illustrations to show how the theory is applied in practice, and exercises to be worked by the student. The book will be useful both as a text for a course on sample surveys with the major emphasis on theory and for individual reading by the student.

The minimum mathematical equipment necessary to follow the great bulk of the material is a familiarity with algebra, especially relatively complicated algebraic expressions, plus a knowledge of probability for finite sample spaces, including combinatorial probabilities. The book presupposes an introductory statistics course that covers means and standard deviations, the normal, binomial, hypergeometric, and multinomial distributions, the central limit theorem, linear regression, and the simpler types of analyses of variance. Since much of classical sample survey theory deals with the distributions of estimators over the set of randomizations provided by the sampling plan, some knowledge of nonparametric methods is helpful.

The topics in this edition are presented in essentially the same order as in earlier editions. New sections have been included, or sections rewritten, primarily for one of three reasons: (1) to present introductions to topics (sampling plans or methods of estimation) relatively new in the field; (2) to cover further work done during the last 15 years on older methods, intended either to improve them or to learn more about their performance; and (3) to shorten, clarify, or simplify proofs given in previous editions.

New topics in this edition include the approximate methods developed for the difficult problem of attaching standard errors or confidence limits to nonlinear estimates made from the results of surveys with complex plans. These methods will be more and more needed as statistical analyses (e.g., regressions) are performed on the results. For surveys containing sensitive questions that some respondents are unlikely to be willing to answer truthfully, a new device is to represent the respondent with either the sensitive question or an innocuous question; the specific choice, made by randomization, is unknown to the interviewer. In some sampling problems it may seem economically attractive, or essential in countries with low sampling resources, to use two overlapping lists (for example, as they are called) to cover the complete population. The method of double sampling has been extended to cases where the objective is to compare the means.

from pathlib import Path

file_path = Path(here("output/markdown/Cochran_1977_SamplingTechniques_Ch1_1-5.md"))

with open(file_path, 'w', encoding='utf-8') as file:
    # Write the string of text to the file
    file.write(cochran_img_md_05)

4.5.2 HowTo

print(all([len(x) == 1 for x in cochran_img_base64[0]]))
print(all([len(x) == 1 for x in cochran_img_base64]))

True
False

# Remove blank string
list(filter(None, ["A", "", ""]))

['A']