The application can use three sources of words to generate crosswords:
1. SQLite databases
2. Text files
3. Simple in-memory lists
As you will see in the following sections, these source types can be used in combination, with as many sources in each category as you like.
The most efficient and recommended source type is SQLite databases. This type has a number of advantages:
-
searching is fast and flexible, replying on the SQLite database engine specifically designed for such purposes
-
the database engine can deal with very large datasets with millions of entries, without any memory or performance overhead
-
an SQLite database is contained in a single
*.db file
-
no additional software / drivers need to be installed; the SQLite driver comes inbuilt with Python
-
databases can be edited to add / change / remove individual words using the inbuilt Word Source Manager or an external tool (such as the free
DB Browser for SQLite)
-
you can also operate the databases directly using SQL queries (in fact, that's what
pycrossword does under the hood)
The internal structure of database sources is shown in the figure below.
Fig. 3.9.1.1. Internal structure of an SQLite database word source
As you can see, every database contains two tables:
1. Parts of speech (tpos)
2. Words (twords)
The parts-of-speech table is simply a reference of all parts of speech that a word may belong to. A typical tpos table doesn't depend on the language and contains the common parts of speech (their abbreviated and full names):
id
|
pos
|
posdesc
|
1
|
N
|
noun
|
2
|
V
|
verb
|
3
|
ADV
|
adverb
|
4
|
ADJ
|
adjective
|
5
|
P
|
participle
|
6
|
PRON
|
pronoun
|
7
|
I
|
interjection
|
8
|
C
|
conjuction
|
9
|
PREP
|
preposition
|
10
|
PROP
|
proposition
|
11
|
MISC
|
miscellaneous / other
|
12
|
NONE
|
no POS
|
The words table (twords) is the main dataset used to search for words. It contains words and their part-of-speech reference (link to the tpos table). An example extract from this table is shown below:
id
|
word
|
idpos
|
84043
|
capsizal
|
1
|
84044
|
capsize
|
2
|
84045
|
capsized
|
4
|
84046
|
capsomer
|
1
|
84047
|
capsomere
|
1
|
84048
|
capstan
|
1
|
The rightmost field in this table (idpos) is internally linked to the unique id field of the part-of-speech table (tpos). You will see that this extract contains 4 nouns, one verb and one adjective.
Read further in
Database Sources to learn how you can easily populate SQLite databases from publicly available word lists, such as spell checker dictionaries.
Simple text files can also be used as word sources. Such a text file contains a list of words with or without part-of-speech data, where each word occupies one line. Below is an example extract from a text file containing Spanish words:
abarrar
abarrederas
abarrenar
abarrocado/GS
abarrotadamente
abarrotamiento/S
abarrotar/RED
abarse
abasí/S
abastardar/RED
You can see above that some words here contain part-of-speech data (marked with red font), while others don't. You will see further that this is no problem for pycrossword; this application will store such non-marked words with a special NONE part-of-speech attribute.
Lists of words (with or without part-of-speech data) can be fed into pycrossword directly, without importing from a file. With simple lists, you can also indicate that they do or do not have part-of-speech data flatly for all words, to economize on memory.
Simple lists are best when you don't want to handle large word sets (a word list should contain less than 10,000 entries not to bloat memory).