Ask a Question

Prefer a chat interface with context about you and your work?

<scp>Canine</scp>: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

<scp>Canine</scp>: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Abstract Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and …