pycrossword
User Guide
×
Menu tab
Index tab

3.9.1. Word source types

 
The application can use three sources of words to generate crosswords:
 
1. SQLite databases
2. Text files
3. Simple in-memory lists
 
As you will see in the following sections, these source types can be used in combination, with as many sources in each category as you like.
 

Database Sources

 
The most efficient and recommended source type is SQLite databases. This type has a number of advantages:
 
 
The internal structure of database sources is shown in the figure below.
Fig. 3.9.1.1. Internal structure of an SQLite database word source
 
As you can see, every database contains two tables:
 
1. Parts of speech (tpos)
2. Words (twords)
 
The parts-of-speech table is simply a reference of all parts of speech that a word may belong to. A typical tpos table doesn't depend on the language and contains the common parts of speech (their abbreviated and full names):
 
id
pos
posdesc
1
N
noun
2
V
verb
3
ADV
adverb
4
ADJ
adjective
5
P
participle
6
PRON
pronoun
7
I
interjection
8
C
conjuction
9
PREP
preposition
10
PROP
proposition
11
MISC
miscellaneous / other
12
NONE
no POS
 
The words table (twords) is the main dataset used to search for words. It contains words and their part-of-speech reference (link to the tpos table). An example extract from this table is shown below:
 
id
word
idpos
84043
capsizal
1
84044
capsize
2
84045
capsized
4
84046
capsomer
1
84047
capsomere
1
84048
capstan
1
 
The rightmost field in this table (idpos) is internally linked to the unique id field of the part-of-speech table (tpos). You will see that this extract contains 4 nouns, one verb and one adjective.
 
Read further in Database Sources to learn how you can easily populate SQLite databases from publicly available word lists, such as spell checker dictionaries.
 

Text Files

 
Simple text files can also be used as word sources. Such a text file contains a list of words with or without part-of-speech data, where each word occupies one line. Below is an example extract from a text file containing Spanish words:
 
abarrar
abarrederas
abarrenar
abarrocado/GS
abarrotadamente
abarrotamiento/S
abarrotar/RED
abarse
abasí/S
abastardar/RED
 
You can see above that some words here contain part-of-speech data (marked with red font), while others don't. You will see further that this is no problem for pycrossword; this application will store such non-marked words with a special NONE part-of-speech attribute.
 

Simple Lists

 
Lists of words (with or without part-of-speech data) can be fed into pycrossword directly, without importing from a file. With simple lists, you can also indicate that they do or do not have part-of-speech data flatly for all words, to economize on memory.
 
Simple lists are best when you don't want to handle large word sets (a word list should contain less than 10,000 entries not to bloat memory).
submit to reddit
Made with help of Dr.Explain