vfclust package

Submodules

vfclust.TextGridParser module

class vfclust.TextGridParser.Phone(phone, start, end)

Bases: object

class vfclust.TextGridParser.TextGrid(textgrid)

Bases: object

parse_phones()

Parse TextGrid phone intervals.

This method parses the phone intervals in a TextGrid to extract each phone and each phone’s start and end times in the audio recording. For each phone, it instantiates the class Phone(), with the phone and its start and end times as attributes of that class instance.

parse_words()

Parse TextGrid word intervals.

This method parses the word intervals in a TextGrid to extract each word and each word’s start and end times in the audio recording. For each word, it instantiates the class Word(), with the word and its start and end times as attributes of that class instance. Further, it appends the class instance’s attribute ‘phones’ for each phone that occurs in that word. (It does this by checking which phones’ start and end times are subsumed by the start and end times of the word.)

class vfclust.TextGridParser.Word(word, start, end)

Bases: object

vfclust.vfclust module

The VFClust package is designed to generate clustering analyses for transcriptions of semantic and phonemic verbal fluency test responses. In a verbal fluency test, the subject is given a set amount of time (usually 60 seconds) to name as many words as he or she can that correspond to a given specification. For a phonemic test, subjects are asked to name words that begin with a specific letter. For a semantic fluency test, subjects are asked to provide words of a certain category, e.g. animals. VFClust groups words in responses based on phonemic or semantic similarity, as described below. It then calculates metrics derived from the discovered groups and returns them as a CSV file or Python dict.

Run with: source /Volumes/Data/virtualenv/vfclust/bin/activate python vfclust.py –similarity-file data/similarity/similarity_rand.txt –threshold 0.5 example/EXAMPLE_sem_custom.TextGrid python vfclust.py –threshold .99 -p s example/EXAMPLE.TextGrid

class vfclust.vfclust.Args

Dummy class to hold argument properties.

class vfclust.vfclust.ParsedResponse(response_type, letter_or_category, quiet=False, cmudict=None, english_words=None, lemmas=None, names=None, permissible_words=None)

Implements a representation of a subject response, along with methods for parsing it.

ParsedResponse is a list-like class that contains a list of Unit objects and properties relevant to type of clustering being performed by VFClust. It implements methods for simplifying the list of Units, removing repetitions, creating compound words, removing irrelevant response tokens, etc.

clean()

Removes any Units that are not applicable given the current semantic or phonetic category.

Modifies:
  • self.unit_list: Removes Units from this list that do not fit into the clustering category.

    it does by by either combining units to make compound words, combining units with the same stem, or eliminating units altogether if they do not conform to the category.

    If the type is phonetic, this method also generates phonetic clusters for all Unit objects in self.unit_list.

This method performs three main tasks:
  1. Removes words that do not conform to the clustering category (i.e. start with the

    wrong letter, or are not an animal).

  2. Combine adjacent words with the same stem into a single unit. The NLTK Porter Stemmer

    is used for determining whether stems are the same. http://www.nltk.org/_modules/nltk/stem/porter.html

  3. In the case of PHONETIC clustering, compute the phonetic representation of each unit.

combine_same_stem_units(index)

Combines adjacent words with the same stem into a single unit.

Parameters:index (int) – Index of Unit in self.unit_list to be combined with the subsequent Unit.
Modifies:
  • self.unit_list: Modifies the .original_text property of the Unit

    corresponding to the index. Changes the .end_time property to be the .end_time of the next Unit, as Units with the same stem are considered

    as single Unit inc lustering. Finally, after extracting the text and timing

    information, it removes the unit at index+1.

create_from_csv(token_list)

Fills the ParsedResponse object with a list of words/tokens originally from a .csv file.

Parameters:token_list (list) – List of strings corresponding to words in the subject response.
Modifies:
  • self.timing_included: csv files do not include timing information

  • self.unit_list: fills it with Unit objects derived from the token_list argument.

    If the type is ‘SEMANTIC’, the words in these units are automatically lemmatized and made into compound words where appropriate.

create_from_textgrid(word_list)

Fills the ParsedResponse object with a list of TextGrid.Word objects originally from a .TextGrid file.

Parameters:word_list (list) – List of TextGrid.Word objects corresponding to words/tokens in the subject response.
Modifies:
  • self.timing_included: TextGrid files include timing information

  • self.unit_list: fills it with Unit objects derived from the word_list argument.

    If the type is ‘SEMANTIC’, the words in these units are automatically lemmatized and made into compound words where appropriate.

display()

Pretty-prints the ParsedResponse to the screen.

generate_phonetic_representation(word)

Returns a generated phonetic representation for a word.

Parameters:word (str) – a word to be phoneticized.
Returns:A list of phonemes representing the phoneticized word.

This method is used for words for which there is no pronunication entry in the CMU dictionary. The function generates a pronunication for the word in the standard CMU format. This can then be converted to a compact phonetic representation using modify_phonetic_representation().

lemmatize()

Lemmatize all Units in self.unit_list.

Modifies:
  • self.unit_list: converts the .text property into its lemmatized form.

This method lemmatizes all inflected variants of permissible words to those words’ respective canonical forms. This is done to ensure that each instance of a permissible word will correspond to a term vector with which semantic relatedness to other words’ term vectors can be computed. (Term vectors were derived from a corpus in which inflected words were similarly lemmatized, meaning that , e.g., ‘dogs’ will not have a term vector to use for semantic relatedness computation.)

make_compound_word(start_index, how_many)

Combines two Units in self.unit_list to make a compound word token.

Parameters:
  • start_index (int) – Index of first Unit in self.unit_list to be combined
  • how_many (int) – Index of how many Units in self.unit_list to be combined.
Modifies:
  • self.unit_list: Modifies the Unit corresponding to the first word

    in the compound word. Changes the .text property to include .text properties from subsequent Units, separted by underscores. Modifies the .original_text property to record each componentword separately. Modifies the .end_time property to be the .end_time of the final unit in the compound word. Finally, after extracting the text and timing information, it removes all units in the compound word except for the first.

modify_phonetic_representation(phonetic_representation)

Returns a compact phonetic representation given a CMUdict-formatted representation.

Parameters:phonetic_representation (list) – a phonetic representation in standard CMUdict formatting, i.e. a list of phonemes like [‘HH’, ‘EH0’, ‘L’, ‘OW1’]
Returns:A string representing a custom phonetic representation, where each phoneme is mapped to a single ascii character.

Changing the phonetic representation from a list to a string is useful for calculating phonetic simlarity scores.

remove_unit(index)

Removes the unit at the given index in self.unit_list. Does not modify any other units.

tokenize()

Tokenizes all multiword names in the list of Units.

Modifies:
  • (indirectly) self.unit_list, by combining words into compound words.

This is done because many names may be composed of multiple words, e.g., ‘grizzly bear’. In order to count the number of permissible words generated, and also to compute semantic relatedness between these multiword names and other names, multiword names must each be reduced to a respective single token.

class vfclust.vfclust.Unit(word, format, type, index_in_timed_response=None)

Class to hold a sequence of 1 or more adjacent words with the same stem, or a compound word.

A Unit may represent:
  • single words (dog, cat)
  • lemmatized/compound words (polar_bear)
  • several adjacent words with the same root (follow/followed/following)
The object also includes:
  • phonetic/semantic representation of the FIRST word, if more than one
  • The start time of the first word and the ending time of the final word
class vfclust.vfclust.VFClustEngine(response_category, response_file_path, target_file_path=None, collection_types=['cluster', 'chain'], similarity_measures=['phone', 'biphone', 'lsa'], clustering_parameter=91, quiet=False, similarity_file=None, threshold=None)

Bases: object

Class used for encapsulating clustering methods and data.

compute_between_collection_interval_duration(prefix)
Calculates BETWEEN-collection intervals for the current collection and measure type
and takes their mean.
Parameters:prefix (str) – Prefix for the key entry in self.measures.

Negative intervals (for overlapping clusters) are counted as 0 seconds. Intervals are calculated as being the difference between the ending time of the last word in a collection and the start time of the first word in the subsequent collection.

Note that these intervals are not necessarily silences, and may include asides, filled pauses, words from the examiner, etc.

Adds the following measures to the self.measures dictionary:
  • TIMING_(similarity_measure)_(collection_type)_between_collection_interval_duration_mean:

    average interval duration separating clusters

compute_collection_measures(no_singletons=False)

Computes summaries of measures using the discovered collections.

Parameters:no_singletons – if True, omits collections of length 1 from all measures and includes “no_singletons_” in the measure name.

Adds the following measures to the self.measures dictionary, prefaced by COLLECTION_(similarity_measure)_(collection_type)_:

  • count: number of collections
  • size_mean: mean size of collections
  • size_max: size of largest collection
  • switch_count: number of changes between clusters
compute_collections()

Finds the collections (clusters,chains) that exist in parsed_response.

Modified:
  • self.collection_sizes: populated with a list of integers indicating

    the number of units belonging to each collection

  • self.collection_indices: populated with a list of strings indicating

    the indices of each element of each collection

  • self.collection_list: populated with a list lists, each list containing

    Unit objects belonging to each collection

There are two types of collections currently implemented: - cluster: every entry in a cluster is sufficiently similar to every other entry - chain: every entry in a chain is sufficiently similar to adjacent entries

Similarity between words is calculated using the compute_similarity_score method. Scores between words are then thresholded and binarized using empirically-derived thresholds (see: ???). Overlap of clusters is allowed (a word can be part of multiple clusters), but overlapping chains are not possible, as any two adjacent words with a lower-than-threshold similarity breaks the chain. Clusters subsumed by other clusters are not counted. Singletons, i.e., clusters of size 1, are included in this analysis.

compute_duration_measures()

Helper function for computing measures derived from timing information.

These are only computed if the response is textgrid with timing information.

All times are in seconds.

compute_pairwise_similarity_score()

Computes the average pairwise similarity score between all pairs of Units.

The pairwise similarity is calculated as the sum of similarity scores for all pairwise word pairs in a response – except any pair composed of a word and itself – divided by the total number of words in an attempt. I.e., the mean similarity for all pairwise word pairs.

Adds the following measures to the self.measures dictionary:
  • COLLECTION_collection_pairwise_similarity_score_mean: mean of pairwise similarity scores
compute_response_continuant_duration(prefix)

Computes mean duration for continuants in response.

Parameters:prefix (str) – Prefix for the key entry in self.measures.
Adds the following measures to the self.measures dictionary:
  • TIMING_(similarity_measure)_(collection_type)_response_continuant_duration_mean: average

    vowel duration of all vowels in the response.

compute_response_vowel_duration(prefix)

Computes mean vowel duration in entire response.

Parameters:prefix (str) – Prefix for the key entry in self.measures.
Adds the following measures to the self.measures dictionary:
  • TIMING_(similarity_measure)_(collection_type)_response_vowel_duration_mean: average

    vowel duration of all vowels in the response.

compute_similarity_score(unit1, unit2)

Returns the similarity score between two words.

The type of similarity scoring method used depends on the currently active method and clustering type.
Parameters:
  • unit1 (Unit) – Unit object corresponding to the first word.
  • unit2 (Unit) – Unit object corresponding to the second word.
Returns:

Number indicating degree of similarity of the two input words. The maximum value is 1, and a higher value indicates that the words are more similar.

:rtype : Float

The similarity method used depends both on the type of test being performed (SEMANTIC or PHONETIC) and the similarity method currently assigned to the self.current_similarity_measure property of the VFClustEngine object. The similarity measures used are the following:

  • PHONETIC/”phone”: the phonetic similarity score (PSS) is calculated

    between the phonetic representations of the input units. It is equal to 1 minus the Levenshtein distance between two strings, normalized to the length of the longer string. The strings should be compact phonetic representations of the two words. (This method is a modification of a Levenshtein distance function available at http://hetland.org/coding/python/levenshtein.py.)

  • PHONETIC/”biphone”: the binary common-biphone score (CBS) depends

    on whether two words share their initial and/or final biphone (i.e., set of two phonemes). A score of 1 indicates that two words have the same intial and/or final biphone; a score of 0 indicates that two words have neither the same initial nor final biphone. This is also calculated using the phonetic representation of the two words.

  • SEMANTIC/”lsa”: a semantic relatedness score (SRS) is calculated

    as the COSINE of the respective term vectors for the first and second word in an LSA space of the specified clustering_parameter. Unlike the PHONETIC methods, this method uses the .text property of the input Unit objects.

compute_similarity_scores()

Produce a list of similarity scores for each contiguous pair in a response.

Calls compute_similarity_score method for every adjacent pair of words. The results are not used in clustering; this is merely to provide a visual representation to print to the screen.

Modifies:
  • self.similarity_scores: Fills the list with similarity scores between adjacent

    words. At this point this list is never used outside of this method.

compute_within_collection_continuant_duration(prefix, no_singletons=False)

Computes the mean duration of continuants from Units within clusters.

Parameters:
  • prefix (str) – Prefix for the key entry in self.measures
  • no_singletons (bool) – If False, excludes collections of length 1 from calculations and adds “no_singletons” to the prefix
Adds the following measures to the self.measures dictionary:
  • TIMING_(similarity_measure)_(collection_type)_within_collection_continuant_duration_mean
compute_within_collection_interval_duration(prefix)

Calculates mean between-word duration WITHIN collections.

Parameters:prefix (str) – Prefix for the key entry in self.measures.

Calculates the mean time between the end of each word in the collection and the beginning of the next word. Note that these times do not necessarily reflect pauses, as collection members could be separated by asides or other noises.

Adds the following measures to the self.measures dictionary:
  • TIMING_(similarity_measure)_(collection_type)_within_collection_interval_duration_mean
compute_within_collection_vowel_duration(prefix, no_singletons=False)

Computes the mean duration of vowels from Units within clusters.

Parameters:
  • prefix (str) – Prefix for the key entry in self.measures
  • no_singletons (bool) – If False, excludes collections of length 1 from calculations and adds “no_singletons” to the prefix
Adds the following measures to the self.measures dictionary:
  • TIMING_(similarity_measure)_(collection_type)_within_collection_vowel_duration_mean
get_collection_measures()

Helper function for calculating measurements derived from clusters/chains/collections

get_collections()

Helper function for determining what the clusters/chains/other collections are.

get_raw_counts()

Determines counts for unique words, repetitions, etc using the raw text response.

Adds the following measures to the self.measures dictionary:
  • COUNT_total_words: count of words (i.e. utterances with semantic content) spoken

    by the subject. Filled pauses, silences, coughs, breaths, words by the interviewer, etc. are all excluded from this count.

  • COUNT_permissible_words: Number of words spoken by the subject that qualify as a

    valid response according to the clustering criteria. Compound words are counted as a single word in SEMANTIC clustering, but as two words in PHONETIC clustering. This is implemented by tokenizing SEMANTIC clustering responses in the __init__ method before calling the current method.

  • COUNT_exact_repetitions: Number of words which repeat words spoken earlier in the

    response. Responses in SEMANTIC clustering are lemmatized before this function is called, so slight variations (dog, dogs) may be counted as exact responses.

  • COUNT_stem_repetitions: Number of words stems identical to words uttered earlier in

    the response, according to the Porter Stemmer. For example, ‘sled’ and ‘sledding’ have the same stem (‘sled’), and ‘sledding’ would be counted as a stem repetition.

  • COUNT_examiner_words: Number of words uttered by the examiner. These start

    with “E_” in .TextGrid files.

  • COUNT_filled_pauses: Number of filled pauses uttered by the subject. These begin

    with “FILLEDPAUSE_” in the .TextGrid file.

  • COUNT_word_fragments: Number of word fragments uttered by the subject. These

    end with “-” in the .TextGrid file.

  • COUNT_asides: Words spoken by the subject that do not adhere to the test criteria are

    counted as asides, i.e. words that do not start with the appropriate letter or that do not represent an animal.

  • COUNT_unique_permissible_words: Number of works spoken by the subject, less asides,

    stem repetitions and exact repetitions.

get_similarity_measures()

Helper function for computing similarity measures.

load_lsa_information()

Loads a dictionary from disk that maps permissible words to their LSA term vectors.

print_output()

Outputs final list of measures to screen a csv file.

The .csv file created has the same name as the input file, with “vfclust_TYPE_CATEGORY” appended to the filename, where TYPE indicates the type of task performed done (SEMANTIC or PHONETIC) and CATEGORY indicates the category requirement of the stimulus (i.e. ‘f’ or ‘animals’ for phonetic and semantic fluency test, respectively.

exception vfclust.vfclust.VFClustException

Bases: exceptions.Exception

Custom exception class – better than using asserts.

vfclust.vfclust.get_duration_measures(source_file_path, output_path=None, phonemic=False, semantic=False, quiet=False, similarity_file=None, threshold=None)

Parses input arguments and runs clustering algorithm.

Parameters:
  • source_file_path – Required. Location of the .csv or .TextGrid file to be analyzed.
  • output_path – Path to which to write the resultant csv file. If left None, path will be set to the source_file_path. If set to False, no file will be written.
  • phonemic – The letter used for phonetic clustering. Note: should be False if semantic clustering is being used.
  • semantic – The word category used for semantic clustering. Note: should be False if phonetic clustering is being used.
  • quiet – Set to True if you want to suppress output to the screen during processing.
  • (optional) (threshold) – When doing semantic processing, this is the path of a file containing custom term similarity scores that will be used for clustering. If a custom file is used, the default LSA-based clustering will not be performed.
  • (optional) – When doing semantic processing, this threshold is used in conjunction with a custom similarity file. The value is used as a semantic similarity cutoff in clustering. This argument is required if a custom similarity file is specified. This argument can also be used to override the built-in cluster/chain thresholds.
Return data:

A dictionary of measures derived by clustering the input response.

vfclust.vfclust.get_mean(list_in_question)

Computes the mean of a list of numbers.

Parameters:list_in_question (list) – list of numbers
Returns:mean of the list of numbers

:rtype : float

vfclust.vfclust.main(test=False)
vfclust.vfclust.print_table(table)

Helper function for printing tables to screen.

Parameters:table (List of tuples.) – List of tuples where each tuple contains the contents of a row, and each entry in the tuple is the contents of a cell in that row.
vfclust.vfclust.test_script()
vfclust.vfclust.validate_arguments(args)

Makes sure arguments are valid, specified files exist, etc.

Module contents

To setup (from terminal): $ cd /path/to/vfclust/download $ python setup.py install

>> import vfclust

For arguments and default values, see the README.md file.

Table Of Contents

This Page