Fork me on GitHub

Project Modules

This project has declared the following modules:

Name Description
TextProcMain The TextProc main and entity classes module.
TextProcStep The TextProc processing step API.
TextProcLogging The TextProc logging facilities.
TextProcPersistence The TextProc persistence access layer, using Java Persistence API.
AbstractTppTextProcStep This module provides a skeletal implementation of a TextProc processing step in the form of an abstract class, to keep code DRY and consistent between steps.
TppTokenizationTextProcStep Processing step for TextProc that tokenizes its input documents via Text Processing Python.
TppStopwordFilteringTextProcStep Processing step for TextProc that removes stopwords from the input documents, via Text Processing Python.
TppLemmatizationTextProcStep Processing step for TextProc that lemmatizes the input document tokens, separated by spaces.
CoreNLPTokenizationTextProcStep Processing step for TextProc that tokenizes its input documents via Stanford CoreNLP.
CoreNLPLemmatizationTextProcStep Processing step for TextProc that lemmatizes each token of its input documents via Stanford CoreNLP.
CoreNLPEntityExtractionTextProcStep Processing step for TextProc that extracts new named entities from documents, from seed sets of entities, using bootstrapped pattern-based learning.
CoreNLPKnowledgeBasePopulationTextProcStep Processing step for TextProc that populates a knowledge base stored in Apache Jena's TDB2 format, using the NER, OpenIE and sentiment annotation facilities included with CoreNLP.
MentionFilteringTextProcStep Processing step for TextProc that removes Reddit mentions from the input documents.
EmptyFilteringTextProcStep Processing step for TextProc that doesn't copy as processed documents the input documents which are empty of meaning.
LuceneIndexTextProcStep Processing step for TextProc that builds a Lucene index for the input documents.
Apache Lucene (uber JAR) TextProc is an automated text processing tool that efficiently and flexibly applies NLP to input documents in a relational database.
EJML (uber JAR) TextProc is an automated text processing tool that efficiently and flexibly applies NLP to input documents in a relational database.