An SRX segmenting engine for Ruby
https://github.com/amake/srx-ruby.git
SRX is a specification for segmenting text, i.e. splitting text into sentences. More specifically it is
This gem provides facilities for reading SRX files and an engine for performing segmentation.
Only a minimal rule set is supplied by default; for actual usage you are encouraged to supply your own SRX rules. One such set of rules is that from LanguageTool; this is conveniently packaged into a companion gem: srx-languagetool-ruby.
There are lots of good segmentation gems out there such as
What makes SRX different is:Some other advantages that are not unique to SRX:
The SRX spec calls for ICU regular expressions, but this library uses standard Ruby regexp. Please note:
\x{hhhh} โ \u{hhhh}\0ooo โ \u{hhhh}Add this line to your application's Gemfile:
gem 'srx'
And then execute:
$ bundle install
Or install it yourself as:
$ gem install srx
Use the default rules like so. Specify the language according the <maprules>
of your SRX (usually two-letter ISO 639-1
codes).
require 'srx'
data = Srx::Data.default
engine = Srx::Engine.new(data)
engine.segment('Hi. How are you?', language: 'en') #=> ["Hi.", " How are you?"]
Or bring your own rules:
data = Srx::Data.from_file(path: 'path/to/my/rules.srx')
engine = Srx::Engine.new(data)
Specify the format as :xml or :html to benefit from special handling of
tags:
# This should only be one segment, but handling as plain text incorrectly
# produces two segments.
input = 'foo <bar baz="a. b."> bazinga'
Srx::Engine.new(Data.default).segment(input, language: 'en')
#=> ["foo <bar baz=\"a.", " b.\"> bazinga"]
Srx::Engine.new(Data.default, format: :xml).segment(input, language: 'en')
#=> ["foo <bar baz=\"a. b.\"> bazinga"]
After checking out the repo, run bin/setup to install dependencies. Then, run
rake test to run the tests. You can also run bin/console for an interactive
prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install. To
release a new version, update the version number in version.rb, and then run
bundle exec rake release, which will create a git tag for the version, push
git commits and the created tag, and push the .gem file to
rubygems.org.
Bug reports and pull requests are welcome on GitHub at https://github.com/amake/srx.
The gem is available as open source under the terms of the MIT License.