Biblical Dataset Preparation
The first step is to create a biblical dataset, which then allows searching using arbitrary data. The biblical dataset contains all the verses in all the translations available. The creation of the dataset is illustrated below using Luke 20:25/Mark 12:17 in Ladislav Sýkora’s translation:
“Ježíš pak řekl jim: „Dávejte tedy, co je císařovo, císaři; a co je Božího, Bohu!“ I divili se mu.”
“And Jesus answered and said to them, Render to Caesar the things that are Caesar's, and to God the things that are God's. And they marveled at Him.”
Each verse is first divided into sub-verses:
verse_text = 'Ježíš pak řekl jim: ,Dávejte tedy, co jest císařovo, císaři, a co jest Božího, Bohu.” I divili se mu.'
subverses = ['Ježíš pak řekl jim Dávejte tedy',
'Dávejte tedy co jest císařovo',
'co jest císařovo císaři',
'císaři a co jest Božího',
'a co jest Božího Bohu.” I divili se mu.',
'Bohu.” I divili se mu.']
Each of these sub-verses is then divided into n-grams of a maximum four characters. Choosing just four characters reflects the results of many experiments. This division of the sub-verse into smaller parts prevents the OCR problems mentioned above. You may notice that some words are marked as stop-words, i.e. words that are ignored in the search. In this case, these are, for example, the words “and,” “also” and “with.”
subverse_1 = ['jezi', 'ezis', 'rekl', 'jim', 'dave', 'avej', 'vejt', 'ejte', 'tedy']
subverse_2 = ['dave', 'avej', 'vejt', 'ejte', 'tedy', 'co', 'jest', 'cisa', 'isar', 'saro', 'arov', 'rovo']
subverse_3 = ['co', 'jest', 'cisa', 'isar', 'saro', 'arov', 'rovo', 'cisa', 'isar', 'sari']
subverse_4 = ['cisa', 'isar', 'sari', 'co', 'jest', 'bozi', 'ozih', 'ziho']
subverse_5 = ['co', 'jest', 'bozi', 'ozih', 'ziho', 'bohu', 'divi', 'ivil', 'vili', 'mu']
subverse_6 = ['bohu', 'divi', 'ivil', 'vili', 'mu']
The n-grammed verse is then converted into a vector that records the number of occurrences of each n-gram in the sub-verse. Thus, the n-grams are replaced by numbers and a dictionary is gradually created that preserves the assignment of n-grams to numbers. The Python package gensim was used for this purpose.
subverse_1 = ((10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (21, 1), (22, 1), (23, 1), (24, 1))
subverse_2 = ((0, 1), (1, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1))
subverse_3 = ((0, 1), (1, 1), (15, 2), (16, 2), (17, 1), (18, 1), (19, 1), (20, 1))
subverse_4 = ((0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (15, 1), (16, 1), (20, 1))
subverse_5 = ([(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1))
subverse_6 = ((5, 1), (6, 1), (7, 1), (8, 1), (9, 1))
This procedure produces a complete dataset that preserves the assignment of verse labels (e.g., Luke 20:25/Mark 12:17) to all vectors that correspond to individual sub-verses across translations.