preprocess¶

preprocess

Package Contents¶

Preprocess

class Preprocess(tokenizer=None, test_sample_size=None, test_p=0.2, stopwords='English', min_doc_count=0, max_doc_freq=1.0, keep_num=False, keep_alphanum=False, strip_html=False, no_lower=False, min_length=3, min_term=0, vocab_size=None, seed=42, verbose=True)¶

Parameters:

test_sample_size – Size of the test set.
test_p – Proportion of the test set. This helps sample the train set based on the size of the test set.
stopwords – List of stopwords to exclude.
min-doc-count – Exclude words that occur in less than this number of documents.
max_doc_freq – Exclude words that occur in more than this proportion of documents.
keep-num – Keep tokens made of only numbers.
keep-alphanum – Keep tokens made of a mixture of letters and numbers.
strip_html – Strip HTML tags.
no-lower – Do not lowercase text
min_length – Minimum token length.
min_term – Minimum term number
vocab-size – Size of the vocabulary (by most common in the union of train and test sets, following above exclusions)
seed – Random integer seed (only relevant for choosing test set)

test_sample_size = None¶

min_doc_count = 0¶

max_doc_freq = 1.0¶

min_term = 0¶

test_p = 0.2¶

vocab_size = None¶

seed = 42¶

parse(texts, vocab)¶

preprocess_jsonlist(dataset_dir, label_name=None, pretrained_WE=False)¶

convert_labels(train_labels, test_labels)¶

preprocess(raw_train_texts, train_labels=None, raw_test_texts=None, test_labels=None, pretrained_WE=False)¶

save(output_dir, vocab, train_texts, train_bow, word_embeddings=None, train_labels=None, test_texts=None, test_bow=None, test_labels=None)¶