57 lines
2.6 KiB
Markdown
57 lines
2.6 KiB
Markdown
|
# carkov #
|
||
|
|
||
|
This is a library for creating and walking simple markov chains. It is
|
||
|
meant for things like text generators (such as ebooks bots and word
|
||
|
generators) and thus is not 'mathetematically correct'. It has some
|
||
|
tools for doing text analysis but more are planned in the future
|
||
|
(stubs exist to illustrate some plans, see TODO.md).
|
||
|
|
||
|
## Command line interface ##
|
||
|
|
||
|
This library includes a command line interface to analyzing text and
|
||
|
then walk the chain and generate text from the analysis.
|
||
|
|
||
|
To analyze a corpus of text files, thus:
|
||
|
|
||
|
`carkov analyze mychain.chain textfile1.txt textfile2.txt ... textfileN.txt`
|
||
|
|
||
|
To walk a chain and generate text form it, thus:
|
||
|
|
||
|
`carkov chain mychain.chain -c 10`
|
||
|
|
||
|
There are two analysis modes currently supported, `english` and
|
||
|
`word`, which are passed to the analyze method with the `-m`
|
||
|
argument. `english` mode analyzes the input in a word-wise method: the
|
||
|
input is segmented into (English-style) sentences, each of which are
|
||
|
analyzed as separate chains of words. `word` segments the input into
|
||
|
tokens, each of which is analyzed as a series of characters
|
||
|
separately.
|
||
|
|
||
|
Analysis also allows a window size to be specified, so that each item
|
||
|
in the chain may be a fixed series of items of a specific length (for
|
||
|
example, the word `foo` with a window of 2, would analyze to (_, _) ->
|
||
|
'f', (_, f) -> o, (f, o) -> o, etc). The wider the window, the more
|
||
|
similar or identical to the input stream the output becomes since
|
||
|
there are fewer total options to follow any given token. This is
|
||
|
specified with the analysis command line with the `-w` argument.
|
||
|
|
||
|
## About Library ##
|
||
|
|
||
|
The library itself exposes objects and interfaces to do the same as
|
||
|
the command line above. A todo item on this project is to generate
|
||
|
documentation and examples, but looking at the contents of __main__.py
|
||
|
should be instructive. The library is written in such a way as to be
|
||
|
pretty agnostic about the items that are chained, and hypothetically
|
||
|
any sequential set of things could work for this. Some framework would
|
||
|
have to be written to support displaying these sorts of things but it
|
||
|
should be possible if non-textual data were desired.
|
||
|
|
||
|
The library also provides a few mechanisms for serializing a ready to
|
||
|
use chain for reuse in other projects. The command line makes use of
|
||
|
the binary serialization mechanism (which uses `msgpack`) to save
|
||
|
chains from the analysis step for re-use in the chain step. There is
|
||
|
also a mechanism which produces a python source file tthat can be
|
||
|
embedded in a target project so that a python project can use the
|
||
|
chain without having to include an extra data file. It should be noted
|
||
|
that this of course is extremely inefficient for large chains.
|