align
.calculate_baseline_alignment¶
-
align.
calculate_baseline_alignment
(input_files, surrogate_file_directory, output_file_directory, semantic_model_input_file, pretrained_input_file, high_sd_cutoff=3, low_n_cutoff=1, id_separator='\\-', condition_label='cond', dyad_label='dyad', all_surrogates=True, keep_original_turn_order=True, delay=1, maxngram=2, use_pretrained_vectors=True, ignore_duplicates=True, add_stanford_tags=False, input_as_directory=True)¶ Calculate baselines for lexical, syntactic, and conceptual alignment between speakers.
Given a directory of individual .txt files and the vocab list that have been generated by the prepare_transcripts preparation stage, return multi-level alignment scores with turn-by-turn and conversation-level metrics for surrogate baseline conversations.
Parameters: input_files : str (directory name) or list of str (file names)
Cleaned files to be analyzed. Behavior governed by input_as_directory parameter as well.
surrogate_file_directory : str
Name of directory where raw surrogate data will be saved.
output_file_directory : str
Name of directory where output for individual surrogate conversations will be saved.
semantic_model_input_file : str
Name of file to be used for creating the semantic model. A compatible file will be saved as an output of prepare_transcripts().
pretrained_input_file : str or None
If using a pretrained vector to create the semantic model, use name of model here. If not, use None. Behavior governed by use_pretrained_vectors parameter as well.
high_sd_cutoff : int, optional (default: 3)
High-frequency cutoff (in SD over the mean) for lexical items when creating the semantic model.
low_n_cutoff : int, optional (default: 1)
Low-frequency cutoff (in raw frequency) for lexical items when creating the semantic models. Items with frequency less than or equal to the number provided here will be removed. To remove the low-frequency cutoff, set to 0.
id_separator : str, optional (default: ‘-‘)
Character separator between the dyad and condition IDs in original data file names.
condition_label : str, optional (default: ‘cond’)
String preceding ID for each unique condition. Anything after this label will be identified as a unique condition ID.
dyad_label : str, optional (default: ‘dyad’)
String preceding ID for each unique dyad. Anything after this label will be identified as a unique dyad ID.
all_surrogates : boolean, optional (default: True)
Specify whether to generate all possible surrogates across original dataset (True) or to generate only a subset of surrogates equal to the real sample size drawn randomly from all possible surrogates (False).
keep_original_turn_order : boolean, optional (default: True)
Specify whether to retain original turn ordering when pairing surrogate dyads (True) or to pair surrogate partners’ turns in random order (False).
delay : int, optional (default: 1)
Delay (or lag) at which to calculate similarity. A lag of 1 (default) considers only adjacent turns.
maxngram : int, optional (default: 2)
Maximum n-gram size for calculations. Similarity scores for n-grams from unigrams to the maximum size specified here will be calculated.
use_pretrained_vectors : boolean, optional (default: True)
Specify whether to use a pretrained gensim model for word2vec analysis. If True, the file name of a valid model must be provided to the pretrained_input_file parameter.
ignore_duplicates : boolean, optional (default: True)
Specify whether to remove exact duplicates when calculating part-of-speech similarity scores. By default, ignore perfectly mimicked lexical items for POS similarity calculation.
add_stanford_tags : boolean, optional (default: False)
Specify whether to return part-of-speech similarity scores based on Stanford POS tagger (in addition to the Penn POS tagger).
input_as_directory : boolean, optional (default: True)
Specify whether the value passed to input_files parameter should be read as a directory or a list of files to be processed.
Returns: surrogate_final_turn_df : Pandas DataFrame
A dataframe of lexical, syntactic, and conceptual alignment scores between turns at specified delay for surrogate partners. NaN values will be returned for turns in which the speaker only produced words that were removed from the corpus (e.g., too rare or too common words) or words that were present in the corpus but not in the semantic model.
surrogate_final_convo_df : Pandas DataFrame
A dataframe of lexical, syntactic, and conceptual alignment scores between surrogate partners across the entire conversation.