preprocess¶
Package Contents¶
- class Preprocess(tokenizer=None, test_sample_size=None, test_p=0.2, stopwords='English', min_doc_count=0, max_doc_freq=1.0, keep_num=False, keep_alphanum=False, strip_html=False, no_lower=False, min_length=3, min_term=0, vocab_size=None, seed=42, verbose=True)¶
- Parameters:
test_sample_size – Size of the test set.
test_p – Proportion of the test set. This helps sample the train set based on the size of the test set.
stopwords – List of stopwords to exclude.
min-doc-count – Exclude words that occur in less than this number of documents.
max_doc_freq – Exclude words that occur in more than this proportion of documents.
keep-num – Keep tokens made of only numbers.
keep-alphanum – Keep tokens made of a mixture of letters and numbers.
strip_html – Strip HTML tags.
no-lower – Do not lowercase text
min_length – Minimum token length.
min_term – Minimum term number
vocab-size – Size of the vocabulary (by most common in the union of train and test sets, following above exclusions)
seed – Random integer seed (only relevant for choosing test set)
- test_sample_size = None¶
- min_doc_count = 0¶
- max_doc_freq = 1.0¶
- min_term = 0¶
- test_p = 0.2¶
- vocab_size = None¶
- seed = 42¶
- parse(texts, vocab)¶
- preprocess_jsonlist(dataset_dir, label_name=None, pretrained_WE=False)¶
- convert_labels(train_labels, test_labels)¶
- preprocess(raw_train_texts, train_labels=None, raw_test_texts=None, test_labels=None, pretrained_WE=False)¶
- save(output_dir, vocab, train_texts, train_bow, word_embeddings=None, train_labels=None, test_texts=None, test_bow=None, test_labels=None)¶