A super simple XML parser for LLMs that removes a lot of pain for traditional XML parsing
https://github.com/Significant-Gravitas/gravitasml.git
A lightweight Python library for parsing custom markup languages, built and used by AutoGPT
GravitasML is purpose-built for parsing simple markup structures, particularly LLM-generated outputs.
By design, it excludes XML features that can introduce security risks:
GravitasML is immune to common XML vulnerabilities because it simply doesn't implement the features that enable them:
| Attack Type | GravitasML |
|---|---|
| Billion Laughs | โ Safe (no entity support) |
| Quadratic Blowup | โ Safe (no entity expansion) |
| External Entity Expansion (XXE) | โ Safe (no external resources) |
| DTD Retrieval | โ Safe (no DTD support) |
| Decompression Bomb | โ Safe (no decompression) |
GravitasML transforms custom markup into Python data structures:
pip install gravitasml
Or with Poetry:
poetry add gravitasml
from gravitasml.token import tokenize
from gravitasml.parser import Parser
# Parse simple markup
markup = "<name>GravitasML</name>"
tokens = tokenize(markup)
parser = Parser(tokens)
result = parser.parse()
print(result) # {'name': 'GravitasML'}
from gravitasml.token import tokenize
from gravitasml.parser import Parser
markup = """
<person>
<name>John Doe</name>
<contact>
<email>john@example.com</email>
<phone>555-0123</phone>
</contact>
</person>
"""
tokens = tokenize(markup)
result = Parser(tokens).parse()
# Result: {
# 'person': {
# 'name': 'John Doe',
# 'contact': {
# 'email': 'john@example.com',
# 'phone': '555-0123'
# }
# }
# }
Transform your markup directly into validated Pydantic models:
from pydantic import BaseModel
from gravitasml.token import tokenize
from gravitasml.parser import Parser
class Contact(BaseModel):
email: str
phone: str
class Person(BaseModel):
name: str
contact: Contact
markup = """
<person>
<name>Jane Smith</name>
<contact>
<email>jane@example.com</email>
<phone>555-9876</phone>
</contact>
</person>
"""
tokens = tokenize(markup)
parser = Parser(tokens)
person = parser.parse_to_pydantic(Person)
print(person.name) # Jane Smith
print(person.contact.email) # jane@example.com
GravitasML automatically converts repeated tags into lists:
from gravitasml.token import tokenize
from gravitasml.parser import Parser
markup = "<tag><a>value1</a><a>value2</a></tag>"
tokens = tokenize(markup)
result = Parser(tokens).parse()
# Result: {'tag': [{'a': 'value1'}, {'a': 'value2'}]}
# Multiple root tags with the same name also become a list
markup2 = "<tag>content1</tag><tag>content2</tag>"
tokens2 = tokenize(markup2)
result2 = Parser(tokens2).parse()
# Result: [{'tag': 'content1'}, {'tag': 'content2'}]
Tag names are automatically normalized - spaces become underscores and names are lowercased:
from gravitasml.token import tokenize
from gravitasml.parser import Parser
# Spaces in tag names are converted to underscores
markup = "<User Profile><First Name>Alice</First Name></User Profile>"
tokens = tokenize(markup)
result = Parser(tokens).parse()
# Result: {'user_profile': {'first_name': 'Alice'}}
GravitasML uses a two-stage parsing approach:
gravitasml.token) - Converts raw markup into a stream of tokensgravitasml.parser) - Builds a tree structure and converts to Python objectsGravitasML comes with a test suite. To run the tests, execute the following command:
python -m unittest discover -v
GravitasML has minimal dependencies:
We welcome contributions! GravitasML uses:
git checkout -b feature/amazing-feature)git commit -m 'Add amazing feature')git push origin feature/amazing-feature)GravitasML is designed for simplicity. It currently does not support:
<tag attr="value">)<tag />)xml.etree.ElementTree or third-party libraries like lxml.
GravitasML is built on the principle that not every markup parsing task needs the complexity of full XML processing. Sometimes you just want to convert simple markup to Python dictionaries without the overhead of namespaces, DTDs, or complex validation rules.
GravitasML is licensed under the MIT License - see the LICENSE file for details.
Built by the AutoGPT Team and used in the AutoGPT project.
Simple markup parsing for modern Python applications.