Cassowary
d6e329ff39
- Add Unweighted mode to command line - Download nltk data if needed. |
||
---|---|---|
carkov | ||
typehints | ||
.gitignore | ||
LICENSE | ||
README.md | ||
TODO.md | ||
build.sh | ||
carkov-logo.png | ||
carkov-logo.svg | ||
clean.sh | ||
setup.cfg | ||
setup.py | ||
tox.ini |
README.md
carkov
This is a library for creating and walking simple markov chains. It is meant for things like text generators (such as ebooks bots and word generators) and thus is not 'mathetematically correct'. It has some tools for doing text analysis but more are planned in the future (stubs exist to illustrate some plans, see TODO.md).
Command line interface
This library includes a command line interface to analyzing text and then walk the chain and generate text from the analysis.
To analyze a corpus of text files, thus:
carkov analyze mychain.chain textfile1.txt textfile2.txt ... textfileN.txt
To walk a chain and generate text form it, thus:
carkov chain mychain.chain -c 10
There are two analysis modes currently supported, english
and
word
, which are passed to the analyze method with the -m
argument. english
mode analyzes the input in a word-wise method: the
input is segmented into (English-style) sentences, each of which are
analyzed as separate chains of words. word
segments the input into
tokens, each of which is analyzed as a series of characters
separately.
Analysis also allows a window size to be specified, so that each item
in the chain may be a fixed series of items of a specific length (for
example, the word foo
with a window of 2, would analyze to (_, ) ->
'f', (, f) -> o, (f, o) -> o, etc). The wider the window, the more
similar or identical to the input stream the output becomes since
there are fewer total options to follow any given token. This is
specified with the analysis command line with the -w
argument.
About Library
The library itself exposes objects and interfaces to do the same as the command line above. A todo item on this project is to generate documentation and examples, but looking at the contents of main.py should be instructive. The library is written in such a way as to be pretty agnostic about the items that are chained, and hypothetically any sequential set of things could work for this. Some framework would have to be written to support displaying these sorts of things but it should be possible if non-textual data were desired.
The library also provides a few mechanisms for serializing a ready to
use chain for reuse in other projects. The command line makes use of
the binary serialization mechanism (which uses msgpack
) to save
chains from the analysis step for re-use in the chain step. There is
also a mechanism which produces a python source file tthat can be
embedded in a target project so that a python project can use the
chain without having to include an extra data file. It should be noted
that this of course is extremely inefficient for large chains.