preprocess ========== .. py:module:: topmost.preprocess .. toctree:: :titlesonly: :maxdepth: 1 preprocess/index.rst Package Contents ---------------- .. autoapisummary:: topmost.preprocess.Preprocess .. py:class:: Preprocess(tokenizer=None, test_sample_size=None, test_p=0.2, stopwords='English', min_doc_count=0, max_doc_freq=1.0, keep_num=False, keep_alphanum=False, strip_html=False, no_lower=False, min_length=3, min_term=0, vocab_size=None, seed=42, verbose=True) :param test_sample_size: Size of the test set. :param test_p: Proportion of the test set. This helps sample the train set based on the size of the test set. :param stopwords: List of stopwords to exclude. :param min-doc-count: Exclude words that occur in less than this number of documents. :param max_doc_freq: Exclude words that occur in more than this proportion of documents. :param keep-num: Keep tokens made of only numbers. :param keep-alphanum: Keep tokens made of a mixture of letters and numbers. :param strip_html: Strip HTML tags. :param no-lower: Do not lowercase text :param min_length: Minimum token length. :param min_term: Minimum term number :param vocab-size: Size of the vocabulary (by most common in the union of train and test sets, following above exclusions) :param seed: Random integer seed (only relevant for choosing test set) .. py:attribute:: test_sample_size :value: None .. py:attribute:: min_doc_count :value: 0 .. py:attribute:: max_doc_freq :value: 1.0 .. py:attribute:: min_term :value: 0 .. py:attribute:: test_p :value: 0.2 .. py:attribute:: vocab_size :value: None .. py:attribute:: seed :value: 42 .. py:method:: parse(texts, vocab) .. py:method:: preprocess_jsonlist(dataset_dir, label_name=None, pretrained_WE=False) .. py:method:: convert_labels(train_labels, test_labels) .. py:method:: preprocess(raw_train_texts, train_labels=None, raw_test_texts=None, test_labels=None, pretrained_WE=False) .. py:method:: save(output_dir, vocab, train_texts, train_bow, word_embeddings=None, train_labels=None, test_texts=None, test_bow=None, test_labels=None)