align.calculate_alignment

align.calculate_alignment(input_files, output_file_directory, semantic_model_input_file, pretrained_input_file, high_sd_cutoff=3, low_n_cutoff=1, delay=1, maxngram=2, use_pretrained_vectors=True, ignore_duplicates=True, add_stanford_tags=False, input_as_directory=True)

Calculate lexical, syntactic, and conceptual alignment between speakers.

Given a directory of individual .txt files and the vocabulary list that have been generated by the prepare_transcripts preparation stage, return multi-level alignment scores with turn-by-turn and conversation-level metrics.

Parameters:

input_files : str (directory name) or list of str (file names)

Cleaned files to be analyzed. Behavior governed by input_as_directory parameter as well.

output_file_directory : str

Name of directory where output for individual conversations will be saved.

semantic_model_input_file : str

Name of file to be used for creating the semantic model. A compatible file will be saved as an output of prepare_transcripts().

pretrained_input_file : str or None

If using a pretrained vector to create the semantic model, use name of model here. If not, use None. Behavior governed by use_pretrained_vectors parameter as well.

high_sd_cutoff : int, optional (default: 3)

High-frequency cutoff (in SD over the mean) for lexical items when creating the semantic model.

low_n_cutoff : int, optional (default: 1)

Low-frequency cutoff (in raw frequency) for lexical items when creating the semantic models. Items with frequency less than or equal to the number provided here will be removed. To remove the low-frequency cutoff, set to 0.

delay : int, optional (default: 1)

Delay (or lag) at which to calculate similarity. A lag of 1 (default) considers only adjacent turns.

maxngram : int, optional (default: 2)

Maximum n-gram size for calculations. Similarity scores for n-grams from unigrams to the maximum size specified here will be calculated.

use_pretrained_vectors : boolean, optional (default: True)

Specify whether to use a pretrained gensim model for word2vec analysis (True) or to construct a new model from the provided corpus (False). If True, the file name of a valid model must be provided to the pretrained_input_file parameter.

ignore_duplicates : boolean, optional (default: True)

Specify whether to remove exact duplicates when calculating part-of-speech similarity scores (True) or to retain perfectly mimicked lexical items for POS similarity calculation (False).

add_stanford_tags : boolean, optional (default: False)

Specify whether to return part-of-speech similarity scores based on Stanford POS tagger in addition to the Penn POS tagger (True) or to return only POS similarity scores from the Penn tagger (False).

input_as_directory : boolean, optional (default: True)

Specify whether the value passed to input_files parameter should be read as a directory (True) or a list of files to be processed (False).

Returns:

real_final_turn_df : Pandas DataFrame

A dataframe of lexical, syntactic, and conceptual alignment scores between turns at specified delay. NaN values will be returned for turns in which the speaker only produced words that were removed from the corpus (e.g., too rare or too common words) or words that were present in the corpus but not in the semantic model.

real_final_convo_df : Pandas DataFrame

A dataframe of lexical, syntactic, and conceptual alignment scores between participants across the entire conversation.