The pipeline’s single source of truth is an enriched JSON file per PDF that combines raw annotations with normalized bibliographic metadata.
This Python script extracts specific text data from PDF files using pre-defined coordinates and saves the results in a CSV file. The script is particularly useful for processing batches of PDF files ...