N P

MLiPS Context

Author: Mike Corley

Motivating Example:

Modern NLP frameworks provide machine learning algorithms for different types of textual processing: tokenization, parsing, semantic tagging, named-entity recognition (NER), etc. Classes of algorithms have different characteristics and constraints: model -> (learning/testing/evaluation) i.e. prediction or classification based? Each varies in input/output requirements, parametrization, data formats, thresholds etc. The NER algorithm uses pretrained models for extracting named entities (people, organizations, and locations) from textual sources. In this example, valid input requires well-formed, grammatically correct English sentences. The source document collection is large a (e.g. terabyte) repository of binary formatted files (pdf, images, ppt, etc.). Other applications may have significantly different characteristics. Traditional methods often require a lot of human interaction managing scripts, providing bridges between stages, and evaluating process quality.

Traditional Approach:

Develop scripted components to capture workflow, pipe output of a tool to the input of a another tool -> “cat somefile.txt | grep…” foreach input_file {
Tika(input_file) | clean data stage 1 | Transform data | clean data stage 2 | NER | …
} Here, Tika is an Apache.org Java application that understands how to parse a variety of document types. OpenNLP is also an Apache.org library of tools that support many of the activities needed for applications like our motivating example. This classic Unix-style (text-based) shell scripting with input/output redirection is rigid, inflexible, and often results in a proliferation of scripts. Is there a way to represent the workflow differently, to flexibly manage variation across all of the different tools, frameworks, algorithms, and data handling needs…? Is there a way to maximize synergy and also enable disparate tools to work together without the need to change a script? Define an abstract processing work-flow pipeline!