preprocess

Package Contents

class Preprocess(tokenizer=None, test_sample_size=None, test_p=0.2, stopwords='English', min_doc_count=0, max_doc_freq=1.0, keep_num=False, keep_alphanum=False, strip_html=False, no_lower=False, min_length=3, min_term=0, vocab_size=None, seed=42, verbose=True)
Parameters:
  • test_sample_size – Size of the test set.

  • test_p – Proportion of the test set. This helps sample the train set based on the size of the test set.

  • stopwords – List of stopwords to exclude.

  • min-doc-count – Exclude words that occur in less than this number of documents.

  • max_doc_freq – Exclude words that occur in more than this proportion of documents.

  • keep-num – Keep tokens made of only numbers.

  • keep-alphanum – Keep tokens made of a mixture of letters and numbers.

  • strip_html – Strip HTML tags.

  • no-lower – Do not lowercase text

  • min_length – Minimum token length.

  • min_term – Minimum term number

  • vocab-size – Size of the vocabulary (by most common in the union of train and test sets, following above exclusions)

  • seed – Random integer seed (only relevant for choosing test set)

test_sample_size = None
min_doc_count = 0
max_doc_freq = 1.0
min_term = 0
test_p = 0.2
vocab_size = None
seed = 42
parse(texts, vocab)
preprocess_jsonlist(dataset_dir, label_name=None, pretrained_WE=False)
convert_labels(train_labels, test_labels)
preprocess(raw_train_texts, train_labels=None, raw_test_texts=None, test_labels=None, pretrained_WE=False)
save(output_dir, vocab, train_texts, train_bow, word_embeddings=None, train_labels=None, test_texts=None, test_bow=None, test_labels=None)