Workflow & Data

Here you can learn about the procedure we have developed as a possible solution for detecting biblical citations (not only) in the press of the First Republic. The description offered here attempts to make this topic as comprehensible as possible without getting too technical.

You can find the complete script and solution on GitHub.

Periodicals

As mentioned on the project homepage, our project focuses on the First Republic press between 1925 and 1939. The specific periodicals were chosen to include a diverse range of topics for which different results in detecting biblical citations might be expected. However, the limited time period, as well as the periodicals chosen, also reflect the availability of data and their chronological overlap. In particular it is important to mention the availability of certain periodicals during specific years – the Catholic Čech was available only between 1925 and 1935, although it was published until 1937. As of the creation of this project, Polední list (The Midday Paper) was only available between 1930 and 1933.

Čech: politický týdeník katolický (The Bohemian: Catholic Political Weekly) was a Catholic newspaper published daily since 1869 (between 1897 and 1903 under the name Katolické listy), representing the oldest political party in the Czech lands, the National Party of Old Bohemia.
Věstník katolického duchovenstva (The Catholic Clergy Gazette) was a monthly journal published between 1900 and 1942, between 1900 and 1920 twice a month. As the subtitles of this gazette indicate, The ecclesiastical-political and interest body of the Catholic clergy in the Lands of the Bohemian Crown, or the ecclesiastical-political and interest body of the Catholic clergy in the Czechoslovak state, dealt to a large extent with Catholic topics.
Moravský hospodář (The Moravian Farmer) succeeded Ústřední list rolnictva moravského (The Central Journal of Moravian Agriculture). It was first published in 1898 and shut down in 1941. This periodical, published once every two weeks, focused on agriculture and economics.
Moravský večerník (The Moravian Evening Post) was published in Olomouc between 1922 and 1945 at various intervals, during the reference period as a daily newspaper. Each issue focused on politics as well as short news from the region, cultural tips, sports news, and advertisements.
Přítomnost: nezávislý týdeník (The Presence: an Independent Weekly) was published between 1924 and 1939, and between 1942 and 1945. In the 1930s, it was one of the most respected newspapers covering politics.
Venkov: orgán České strany agrární (The Countryside: Journal of the Czech Agrarian Party), was the main medium of the Czech Agrarian Party, issued by the central press authority of the Agrarian Party since 1916 and published between 1906 and1938.
Studentský časopis (The Student Magazine) was first published in 1922 and continued till 1942. It was a youth magazine and its content was designed accordingly. During the reference period it was published ten times a year.
Československý zemědělec (The Czechoslovak Farmer) was published weekly between 1919 and 1939. This periodical regularly contained several supplements dealing with agriculture and the economy in the broadest sense.
Český učitel: věstník Zemského ústředního spolku jednot učitelských v Čechách (The Czech Teacher: Journal of the Provincial Central Association of Teachers’ Units in Bohemia) was first published in 1897 as a weekly newspaper (except during the holidays), was discontinued during the First World War, and resumed in 1919. This journal ceased to exist in 1941. It focused mostly on topics concerning education and teaching.
Posel záhrobní: spiritistický časopis věnovaný záhadám duševním (The Messenger from Beyond the Grave: a Spiritualist Magazine Dealing with the Mysteries of the Spirit) was a prominent medium of the Czech spiritualists associated with Karel Sezemský. It was published four times a year between 1900 and 1940 and contained not only translations from foreign magazines and mediums’ speeches, but also announced upcoming spiritualist lectures and conventions.
Polední list (The Midday Paper) was a non-political magazine published daily by the Tempo journalistic concern between 1929 and 1945.

Below, you can see a chart that provides a general overview of the scope of the dataset under study.

The Question of Data

It might seem easy and straightforward to find a biblical verse in any text. However, many complications are encountered during such process, stemming from the nature of both the biblical text and the text in which we are searching for the quotations. Below we list the most important obstacles which stand in the way of a simple fulltext search.

The Question of the Biblical Text

The Bible is available in Czech in many translations. The situation in the 1930s was no different: multiple translations were similarly used in parallel. As mentioned above, the availability of translations in use is unfortunately very limited, and many of them are still subject to copyright law. What is more, the studied texts often contain various paraphrases or allusions that may be considered biblical quotations, but do not correspond to any of the translations available. For example, a well-know part of the verse Matt 22:37 “Milovati bodes Pána Boha svého z celého srdce svého, a ze vší duše své, a ze vší mysli své.” (You shall love the Lord your God with all your heart, with all your soul, and with all your mind.) was also found in “Milovati budeš Pána Boha svého … nade všecko” (“You shall love the Lord your God … above all”).

Moreover, the biblical verses are often quoted only in part. This means that it was necessary to divide verses into smaller parts – called sub-verses – for the purpose of this project. However, automated division into smaller parts causes further complications, as such divisions do not always create meaningful parts of verses. This may lead to distorted results.

Further difficulties are caused by verses and sub-verses that are too general and which, although they may be fully present in the text being searched, are not biblical citations. This issue is resolved by using a previously prepared list of such verses and sub-verses that are ignored in the resulting data processing. This, however, means that manual work by the scholar is necessary, which may not always be 100% accurate (the corresponding verses were selected only by a search), and some general verses may turn out to be actual biblical citations. Exod 20:13, the commandment “Thou shalt not kill.,” is an example of this. Whether or not a particular citation is actually biblical can only be discerned from its context. This level of detection would require a different, significantly more complex approach.

The final obstacle mentioned here is the internal intertextuality of the Bible. Some verses and sub-verses appear in more than one part in the Bible, so it is not always clear to which verse a given citation should be attributed. An example of this phenomenon is the famous idiom “hlas volajícího na poušti,” (“The voice of one crying in the wilderness”) found in several parts of the Bible (Matt 3:3, John 1:23 and Mark 1:3 refer to Isa 40:3).

The Question of Periodicals

But it is not just the biblical text that may be problematic. The periodicals in which we search for the verses also have their shortcomings. The first of these is their availability, which limits our choice. The second and most important problem is the quality of the OCR. This obstacle sometimes makes working with the data completely impossible. Fortunately, in most cases the OCR quality is sufficient to detect the citations within a certain tolerance.

            
""" An example of a completely misrecognised text. """
bad_ocr_example = ':1$yé^7nisn^rátu>" .tam\'.^hilcòlnr\' v\'mmt^ferjèh^Vojenskí; ^viětíek, pochybu je, д^е Iry pbíipsl tra ^гогкаг!е vlc/zastřelení j rukojmí řgeUibferai Rozkaž\'byl tak‘e|z!r»ičeň, ^zaie^oyalo •*Se jen usnesení\' 1.; jptíníU, žiSající i zaStřek: ní rukojmí, к nětmuž^Eglhofer .poznamenal: “Jsem; srbz(Шпеп.'

""" An example of an imperfectly recognised text that contains a biblical citation. """
poor_ocr_with_bible_quote = '“Mrlovati budeš bližního svého jako sebe samého”, je teprve druhé ze dvou přikázání lásky, podobné prvému :“Мдlovati budeš Pána Boha svého ... nade všecko.” Proč dnešní láska blíženecká je namnoze tak neplodná, pouhá slova?'

The Process

To overcome the aforementioned limitations, we divided the search into several steps. Below, the search process is demonstrated in one particular case. You can find the detailed process, including the code, at the project GitHub, especially at Jupyter Notebook, which describes the search process. The whole project has been built in Python v. 3.9.9 using external packages.

            
import pandas as pd
import os
import joblib

from os import listdir as os_listdir
from os.path import isdir as os_path_isdir
from os.path import exists as os_exists
from os import remove as os_remove
from json import load as json_load
from re import sub as re_sub
from re import split as re_split
from os.path import join as join_path
from unidecode import unidecode
from time import time
from Levenshtein import distance
from gensim import corpora
from collections import defaultdict
from nltk import word_tokenize, sent_tokenize
from math import ceil, isnan

Biblical Dataset Preparation

The first step is to create a biblical dataset, which then allows searching using arbitrary data. The biblical dataset contains all the verses in all the translations available. The creation of the dataset is illustrated below using Luke 20:25/Mark 12:17 in Ladislav Sýkora’s translation:

“Ježíš pak řekl jim: „Dávejte tedy, co je císařovo, císaři; a co je Božího, Bohu!“ I divili se mu.”
“And Jesus answered and said to them, Render to Caesar the things that are Caesar's, and to God the things that are God's. And they marveled at Him.”

Each verse is first divided into sub-verses:

            
verse_text = 'Ježíš pak řekl jim: ,Dávejte tedy, co jest císařovo, císaři, a co jest Božího, Bohu.” I divili se mu.'
subverses = ['Ježíš pak řekl jim Dávejte tedy',
        'Dávejte tedy co jest císařovo',
        'co jest císařovo císaři',
        'císaři a co jest Božího',
        'a co jest Božího Bohu.” I divili se mu.',
        'Bohu.” I divili se mu.']

Each of these sub-verses is then divided into n-grams of a maximum four characters. Choosing just four characters reflects the results of many experiments. This division of the sub-verse into smaller parts prevents the OCR problems mentioned above. You may notice that some words are marked as stop-words, i.e. words that are ignored in the search. In this case, these are, for example, the words “and,” “also” and “with.”

            
subverse_1 = ['jezi', 'ezis', 'rekl', 'jim', 'dave', 'avej', 'vejt', 'ejte', 'tedy']
subverse_2 = ['dave', 'avej', 'vejt', 'ejte', 'tedy', 'co', 'jest', 'cisa', 'isar', 'saro', 'arov', 'rovo'] 
subverse_3 = ['co', 'jest', 'cisa', 'isar', 'saro', 'arov', 'rovo', 'cisa', 'isar', 'sari']
subverse_4 = ['cisa', 'isar', 'sari', 'co', 'jest', 'bozi', 'ozih', 'ziho'] 
subverse_5 = ['co', 'jest', 'bozi', 'ozih', 'ziho', 'bohu', 'divi', 'ivil', 'vili', 'mu'] 
subverse_6 = ['bohu', 'divi', 'ivil', 'vili', 'mu']

The n-grammed verse is then converted into a vector that records the number of occurrences of each n-gram in the sub-verse. Thus, the n-grams are replaced by numbers and a dictionary is gradually created that preserves the assignment of n-grams to numbers. The Python package gensim was used for this purpose.

            
subverse_1 = ((10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (21, 1), (22, 1), (23, 1), (24, 1))
subverse_2 = ((0, 1), (1, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1))
subverse_3 = ((0, 1), (1, 1), (15, 2), (16, 2), (17, 1), (18, 1), (19, 1), (20, 1))
subverse_4 = ((0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (15, 1), (16, 1), (20, 1))
subverse_5 = ([(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1))
subverse_6 = ((5, 1), (6, 1), (7, 1), (8, 1), (9, 1))

This procedure produces a complete dataset that preserves the assignment of verse labels (e.g., Luke 20:25/Mark 12:17) to all vectors that correspond to individual sub-verses across translations.

Searching in the Press

Once we have created a biblical dataset, we can start searching arbitrary data, in this case the First Republican press. We first divide each document into smaller parts. Through experiment, we arrived at dividing documents into parts of six sentences each, always with at least one sentence overlapping (in case the search substring contains two sentences, and occurs at the border of the parts of the searched document).

Each of these smaller parts of the document is then divided into n-grams and vectorised using the same vocabulary that originated while creating the biblical dataset. The resulting vector is then compared with the vectors of the individual sub-verses. If the n-grams of a sub-verse appear in at least 70% of the extract, a preliminary match is recorded.

            
""" An example of an excerpt from the press that includes a biblical citation. """
journal_passage = 'Věrnost církvi, věrnost vlasti, věrnost národa. Sám Kristus řekl určitě a jasné: Dávejte co jest císařovo císaři a co jest Božího Bohu. Jiný apoštolský výrok: Není mocnosti leč od Boha a které jsou, od Boha zřízeny jsou. Stát bez Boba byí by na chybném podkladu.'

""" vectorised part of an excerpt containing a biblical citation """
vectorized_passage = ((0, 2), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (10, 1), (11, 1), (12, 1), (13, 1), (15, 2), (16, 2), (17, 1), (18, 1), (19, 1), (20, 1), (23, 1))

""" a comparison of the sub-verse and the excerpt: """
subverse_text = 'Dávejte tedy co jest císařovo'
subverse_vector = ((0, 1), (1, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1))

overlap = ((0, 2), (1, 2), (10, 1), (11, 1), (12, 1), (13, 1), (15, 2), (16, 2), (17, 1), (18, 1), (19, 1))

""" the only word from the sub-verse that was not detected is "tedy" ("therefore") """
unmatched = ((14, 1))

""" Here, the overlap is 91.66%. """
hit = 0.9166

Subsequently, each sub-verse preliminarily detected in this way is individually checked for its actual presence in the text. Because of potential issues with the OCR and the limitations of Bible translations, a certain level of tolerance is set to allow a match to be found even if the verse does not appear in the exact wording. For these purposes, the Levenshtein distance algorithm was used. In short, this ‘distance’ indicates how many characters have to be changed to make two identical strings. Through experiment, we concluded that at least 85% of the total length of the substring should be identical. Below you can see the correspondence of the individual sub-verses in the sample passage.

            
""" sample of the Levenshtein distance measurement """
from Levenshtein import distance

subverse_string = 'davejte tedy co jest cisarovo'
query_string = 'davejte co jest cisarovo cisa'

tolerance = len(subverse_string)*(1-0.85)

if distance(subverse_string, query_string) <= tolerance:
match = True

In this case, the word “tedy” (“therefore”) means that this is not a match. Not only due to the word “therefore” itself, but also because its presence extended the compared passage by the string “cisa.” Thus, a total of 10 characters must be modified to make the compared strings identical. In this case, the match is therefore only 65.5%. In the case of the verse under comparison, however, the sub-verses “co jest císařovo císaři” and “císaři a co jest Božího” fully appear in the extract searched.

This sample also demonstrates the issue of dividing verses into sub-verses. In many cases, this division does not make much sense, but it turns out to be quite useful in practice. A more ‘intelligent’ division of the verses would require a more complex approach. For more information on how the verse splitting is done, see the Jupyter Notebook describing the search process.

Result Cleaning

Citations found using the procedure described above are fundamentally imperfect. Set tolerances that allow citations to be found even in poorly recognised text lead to many false positives. In addition, many verses and sub-verses have been detected that, while matching the text, are not really biblical citations. Therefore, the preliminary results must still be cleaned. The entire process of cleaning the results, which includes a preliminary evaluation of overall compliance, can be explored in the Jupyter Notebook, which describes the evaluation. In addition to filtering out duplicate results, the following two critical steps must be taken:

Filtering the lone ‘stop-subverses.’ Out of the detected results, those sub-verses are selected that have no clear biblical relevance on their own. These may include citations of an agricultural/pastoral nature, e.g. Nu 32:4 “máme velmi mnoho dobytka” (“we have a lot of cattle”) , references to law and justice, e.g. Heb 11:33 “vykonali spravedlnost” (“worked righteousness”), or other general verses. However, during the selection, it is good to remember that some verses that look ordinary may in fact be biblical citations, so there are obviously some losses when using this filtering method. Stop-subverses may nevertheless also include those sub-verses which, while having a clear biblical connotation, are not biblical quotations, but common phrases within a Christian setting, such as “a Pána Ježíše Krista” (“and The Lord Jesus Christ”). A total of 2,692 stop-subverses were identified.
In addition to the stop-subverses, some verses were also selected for which 100% agreement is required to be recognised as a result. Tolerance of typos in verses, such as “Nezabiješ” (“Thou shalt not kill”) , results in too many false positives.
Dealing with Multiple Attributions. As mentioned above, some verses are duplicated or have a very similar meaning. Generally, the citation is attributed to the verse with the better match. When the match is complete, the citation is marked as duplicate, but multiple verses are left.

Results filtered in this way can be subjected to further manual checking or additional filters. For example, it is possible to select only those citations that contain the entire verse and are therefore highly likely to be genuine quotes (‘sure citations’ depicted in purple in the chart below). While this significantly reduces the number of false positive citations, of course many less clear citations are lost.

In the following chart, you can see the evolution of the total number of citations in the years following each cleaning step. The data presented in the charts across this project are based on the filtered multiple attributions shown in orange. These results may still include a certain number of false positives, but at the same time, they do not remove an unnecessary number of false negatives from the results.

You can find all the results here, especially in this CSV file.

Limitations

The entire procedure is designed to allow researchers to set different tolerance levels at their discretion, tailored to the corpus being searched. The solution proposed here is still fairly imperfect and can only serve for the pre-selection of excerpts which the researcher can subsequently study in detail. Statistical outputs should be perceived as indicative only.

Possible Improvements

Given the imperfections of the proposed solution, it is of course possible to continue working on it and make improvements. The first area of improvement could be the speed of the whole process. You can see summaries of the time needed for searching below. During the creation of this workflow, significant improvements have already been made, influenced by factors such as the following:

finding the ideal feature settings that reflect the nature of the datasets
choosing the appropriate Python packages (e.g. the Levenshtein.distance option versus nltk.metrics.distance.edit_distance) or vectorisation using the gensim package
avoiding dot operations when importing functions from external packages (see above).

Faster results can be achieved in other ways as well – either by improving the script itself, transforming it into another programming language (Rust was among those recommended), or using more computationally powerful machines.

That being said, it is more important for development to focus on improving search accuracy. One possible route is using machine learning (ML) methods, but this assumes that a clean dataset is already available for training purposes. Thus, the solution proposed here can provide preliminary data which, when cleaned, can be used as input for ML solutions.

It may be simpler to detect the occurrence of keywords around the found citation. If words such as “Jesus”, “Bible” and the like occur in the vicinity, the chance that it is a biblical citation is much higher. However, this approach is of course of limited predictive value, and is also limited by the current state of the OCR of searched documents, see above.

Trying to improve the searched dataset, in this case the First Republic press, may also be worth considering. Newer and more accurate OCR programmes can be applied to the searched dataset; it is also possible to try to improve its existing state using correctors such as the LINDAT Corrector.

Time Requirements

This chart shows the average time (in seconds) necessary to process each page (pagination according to the periodical, not standard pages). The results mainly reflect the total length of the analysed pages (e.g. the periodical Venkov has the longest analysis times primarily because it has the longest pages), but the expected number of detected citations is also reflected – the processing time increases substantially especially when verifying the Levenshtein distance (see above). During the creation of the script, the required times were gradually improved (the times shown here refer to the latest version published on GitHub). For the latest version, the total search time was approximately 220 and a half hours.

Additional time was necessary for subsequent data cleaning, where most of the time was devoted to the (manual) selection of stop-subverses.