Bases: object
Extract word and phone intervals from a TextGrid.
These word and phone intervals contain the words and phones themselves, as well as their respective start and end times in the audio recording.
Parse TextGrid phone intervals.
This method parses the phone intervals in a TextGrid to extract each phone and each phone’s start and end times in the audio recording. For each phone, it instantiates the class Phone(), with the phone and its start and end times as attributes of that class instance.
Parse TextGrid word intervals.
This method parses the word intervals in a TextGrid to extract each word and each word’s start and end times in the audio recording. For each word, it instantiates the class Word(), with the word and its start and end times as attributes of that class instance. Further, it appends the class instance’s attribute ‘phones’ for each phone that occurs in that word. (It does this by checking which phones’ start and end times are subsumed by the start and end times of the word.)
Bases: object
Bases: object
The VFClust package is designed to generate clustering analyses for transcriptions of semantic and phonemic verbal fluency test responses. In a verbal fluency test, the subject is given a set amount of time (usually 60 seconds) to name as many words as he or she can that correspond to a given specification. For a phonemic test, subjects are asked to name words that begin with a specific letter. For a semantic fluency test, subjects are asked to provide words of a certain category, e.g. animals. VFClust groups words in responses based on phonemic or semantic similarity, as described below. It then calculates metrics derived from the discovered groups and returns them as a CSV file or Python dict.
Helper function for printing tables to screen.
Parameters: | table (List of tuples.) – List of tuples where each tuple contains the contents of a row, and each entry in the tuple is the contents of a cell in that row. |
---|
Computes the mean of a list of numbers.
Parameters: | list_in_question (list) – list of numbers |
---|---|
Returns: | mean of the list of numbers |
:rtype : float
Bases: exceptions.Exception
Custom exception class – better than using asserts.
Dummy class to hold argument properties.
Class to hold a sequence of 1 or more adjacent words with the same stem, or a compound word.
- single words (dog, cat)
- lemmatized/compound words (polar_bear)
- several adjacent words with the same root (follow/followed/following)
Initialization of Unit object.
Parameters: |
|
---|
:rtype : Unit object
Implements a representation of a subject response, along with methods for parsing it.
ParsedResponse is a list-like class that contains a list of Unit objects and properties relevant to type of clustering being performed by VFClust. It implements methods for simplifying the list of Units, removing repetitions, creating compound words, removing irrelevant response tokens, etc.
Initializes a ParsedResponse object.
Parameters: |
|
---|
Fills the ParsedResponse object with a list of words/tokens originally from a .csv file.
Parameters: | token_list (list) – List of strings corresponding to words in the subject response. |
---|
self.timing_included: csv files do not include timing information
If the type is ‘SEMANTIC’, the words in these units are automatically lemmatized and made into compound words where appropriate.
Fills the ParsedResponse object with a list of TextGrid.Word objects originally from a .TextGrid file.
Parameters: | word_list (list) – List of TextGrid.Word objects corresponding to words/tokens in the subject response. |
---|
self.timing_included: TextGrid files include timing information
If the type is ‘SEMANTIC’, the words in these units are automatically lemmatized and made into compound words where appropriate.
Lemmatize all Units in self.unit_list.
This method lemmatizes all inflected variants of permissible words to those words’ respective canonical forms. This is done to ensure that each instance of a permissible word will correspond to a term vector with which semantic relatedness to other words’ term vectors can be computed. (Term vectors were derived from a corpus in which inflected words were similarly lemmatized, meaning that , e.g., ‘dogs’ will not have a term vector to use for semantic relatedness computation.)
Tokenizes all multiword names in the list of Units.
This is done because many names may be composed of multiple words, e.g., ‘grizzly bear’. In order to count the number of permissible words generated, and also to compute semantic relatedness between these multiword names and other names, multiword names must each be reduced to a respective single token.
Combines two Units in self.unit_list to make a compound word token.
Parameters: |
|
---|
in the compound word. Changes the .text property to include .text properties from subsequent Units, separted by underscores. Modifies the .original_text property to record each componentword separately. Modifies the .end_time property to be the .end_time of the final unit in the compound word. Finally, after extracting the text and timing information, it removes all units in the compound word except for the first.
Removes the unit at the given index in self.unit_list. Does not modify any other units.
Combines adjacent words with the same stem into a single unit.
Parameters: | index (int) – Index of Unit in self.unit_list to be combined with the subsequent Unit. |
---|
corresponding to the index. Changes the .end_time property to be the .end_time of the next Unit, as Units with the same stem are considered
as single Unit inc lustering. Finally, after extracting the text and timing
information, it removes the unit at index+1.
Pretty-prints the ParsedResponse to the screen.
Returns a generated phonetic representation for a word.
Parameters: | word (str) – a word to be phoneticized. |
---|---|
Returns: | A list of phonemes representing the phoneticized word. |
This method is used for words for which there is no pronunication entry in the CMU dictionary. The function generates a pronunication for the word in the standard CMU format. This can then be converted to a compact phonetic representation using modify_phonetic_representation().
Returns a compact phonetic representation given a CMUdict-formatted representation.
Parameters: | phonetic_representation (list) – a phonetic representation in standard CMUdict formatting, i.e. a list of phonemes like [‘HH’, ‘EH0’, ‘L’, ‘OW1’] |
---|---|
Returns: | A string representing a custom phonetic representation, where each phoneme is mapped to a single ascii character. |
Changing the phonetic representation from a list to a string is useful for calculating phonetic simlarity scores.
Removes any Units that are not applicable given the current semantic or phonetic category.
it does by by either combining units to make compound words, combining units with the same stem, or eliminating units altogether if they do not conform to the category.
If the type is phonetic, this method also generates phonetic clusters for all Unit objects in self.unit_list.
wrong letter, or are not an animal).
is used for determining whether stems are the same. http://www.nltk.org/_modules/nltk/stem/porter.html
In the case of PHONETIC clustering, compute the phonetic representation of each unit.
Bases: object
Class used for encapsulating clustering methods and data.
Initialize for VFClust analysis of a verbal phonetic or semantic fluency test response.
target_file_path – file to which VF-Clust CSV output will be written. collection_types – list of “cluster” or “chain” - what the measures should be calculated over similarity_measures – list of types of similarity measures to use between words
- at this point “phone” and “biphone” are supported
Parameters: |
|
---|
parse input arguments
clustering, i.e. permissible words, LSA feature vectors, a dictionary of English words, etc
parses the subject response, generating a parsed_response object
performs clustering
produces a .csv file with clustering results.
The self.measures dictionary is used to hold all measures derived from the analysis. The acutal collections produced are printed to screen, but only the measures derived from clustering are output to the .csv file.
Note
Both clusters and chains are implemented as collection types. Because there is more than one type, the word “collection” is used throughout to refer to both clusters and chains. However, “clustering” is still used to mean the process of discovering these groups.
Note
At this point, the only category of semantic clustering available is “animals.”
Loads a dictionary from disk that maps permissible words to their LSA term vectors.
Helper function for computing similarity measures.
Helper function for determining what the clusters/chains/other collections are.
Helper function for calculating measurements derived from clusters/chains/collections
Determines counts for unique words, repetitions, etc using the raw text response.
by the subject. Filled pauses, silences, coughs, breaths, words by the interviewer, etc. are all excluded from this count.
valid response according to the clustering criteria. Compound words are counted as a single word in SEMANTIC clustering, but as two words in PHONETIC clustering. This is implemented by tokenizing SEMANTIC clustering responses in the __init__ method before calling the current method.
response. Responses in SEMANTIC clustering are lemmatized before this function is called, so slight variations (dog, dogs) may be counted as exact responses.
the response, according to the Porter Stemmer. For example, ‘sled’ and ‘sledding’ have the same stem (‘sled’), and ‘sledding’ would be counted as a stem repetition.
with “E_” in .TextGrid files.
with “FILLEDPAUSE_” in the .TextGrid file.
end with “-” in the .TextGrid file.
counted as asides, i.e. words that do not start with the appropriate letter or that do not represent an animal.
stem repetitions and exact repetitions.
Returns the similarity score between two words.
The type of similarity scoring method used depends on the currently active method and clustering type.
Parameters: |
|
---|---|
Returns: | Number indicating degree of similarity of the two input words. The maximum value is 1, and a higher value indicates that the words are more similar. |
:rtype : Float
The similarity method used depends both on the type of test being performed (SEMANTIC or PHONETIC) and the similarity method currently assigned to the self.current_similarity_measure property of the VFClustEngine object. The similarity measures used are the following:
- PHONETIC/”phone”: the phonetic similarity score (PSS) is calculated
between the phonetic representations of the input units. It is equal to 1 minus the Levenshtein distance between two strings, normalized to the length of the longer string. The strings should be compact phonetic representations of the two words. (This method is a modification of a Levenshtein distance function available at http://hetland.org/coding/python/levenshtein.py.)
- PHONETIC/”biphone”: the binary common-biphone score (CBS) depends
on whether two words share their initial and/or final biphone (i.e., set of two phonemes). A score of 1 indicates that two words have the same intial and/or final biphone; a score of 0 indicates that two words have neither the same initial nor final biphone. This is also calculated using the phonetic representation of the two words.
- SEMANTIC/”lsa”: a semantic relatedness score (SRS) is calculated
as the COSINE of the respective term vectors for the first and second word in an LSA space of the specified clustering_parameter. Unlike the PHONETIC methods, this method uses the .text property of the input Unit objects.
Produce a list of similarity scores for each contiguous pair in a response.
Calls compute_similarity_score method for every adjacent pair of words. The results are not used in clustering; this is merely to provide a visual representation to print to the screen.
words. At this point this list is never used outside of this method.
Finds the collections (clusters,chains) that exist in parsed_response.
the number of units belonging to each collection
the indices of each element of each collection
Unit objects belonging to each collection
There are two types of collections currently implemented: - cluster: every entry in a cluster is sufficiently similar to every other entry - chain: every entry in a chain is sufficiently similar to adjacent entries
Similarity between words is calculated using the compute_similarity_score method. Scores between words are then thresholded and binarized using empirically-derived thresholds (see: ???). Overlap of clusters is allowed (a word can be part of multiple clusters), but overlapping chains are not possible, as any two adjacent words with a lower-than-threshold similarity breaks the chain. Clusters subsumed by other clusters are not counted. Singletons, i.e., clusters of size 1, are included in this analysis.
Computes the average pairwise similarity score between all pairs of Units.
The pairwise similarity is calculated as the sum of similarity scores for all pairwise word pairs in a response – except any pair composed of a word and itself – divided by the total number of words in an attempt. I.e., the mean similarity for all pairwise word pairs.
Computes summaries of measures using the discovered collections.
Parameters: | no_singletons – if True, omits collections of length 1 from all measures and includes “no_singletons_” in the measure name. |
---|
Adds the following measures to the self.measures dictionary, prefaced by COLLECTION_(similarity_measure)_(collection_type)_:
- count: number of collections
- size_mean: mean size of collections
- size_max: size of largest collection
- switch_count: number of changes between clusters
Helper function for computing measures derived from timing information.
These are only computed if the response is textgrid with timing information.
All times are in seconds.
Computes mean vowel duration in entire response.
Parameters: | prefix (str) – Prefix for the key entry in self.measures. |
---|
vowel duration of all vowels in the response.
Computes mean duration for continuants in response.
Parameters: | prefix (str) – Prefix for the key entry in self.measures. |
---|
vowel duration of all vowels in the response.
Parameters: | prefix (str) – Prefix for the key entry in self.measures. |
---|
Negative intervals (for overlapping clusters) are counted as 0 seconds. Intervals are calculated as being the difference between the ending time of the last word in a collection and the start time of the first word in the subsequent collection.
Note that these intervals are not necessarily silences, and may include asides, filled pauses, words from the examiner, etc.
average interval duration separating clusters
Calculates mean between-word duration WITHIN collections.
Parameters: | prefix (str) – Prefix for the key entry in self.measures. |
---|
Calculates the mean time between the end of each word in the collection and the beginning of the next word. Note that these times do not necessarily reflect pauses, as collection members could be separated by asides or other noises.
Computes the mean duration of vowels from Units within clusters.
Parameters: |
|
---|
Computes the mean duration of continuants from Units within clusters.
Parameters: |
|
---|
Outputs final list of measures to screen a csv file.
The .csv file created has the same name as the input file, with “vfclust_TYPE_CATEGORY” appended to the filename, where TYPE indicates the type of task performed done (SEMANTIC or PHONETIC) and CATEGORY indicates the category requirement of the stimulus (i.e. ‘f’ or ‘animals’ for phonetic and semantic fluency test, respectively.
Parses input arguments and runs clustering algorithm.
Parameters: |
|
---|---|
Return data: | A dictionary of measures derived by clustering the input response. |
Makes sure arguments are valid, specified files exist, etc.
To setup (from terminal): $ cd /path/to/vfclust/download $ python setup.py install
>> import vfclust
For arguments and default values, see the README.md file.