Skip to the content.
LVTag Logo

LVTag Specification

Version 1.0
Created by Danslav Slavenskoj
Date: May 2025

Languages: 中文简体 中文繁體 Čeština Deutsch English Español Français Hrvatski 日本語 한국어 Polski Português Русский Српски

Overview

The Language Variant Tag (LVTag) format is a systematic approach to language classification that extends the BCP 47 standard using private-use subtags. It enables precise identification of language varieties across multiple dimensions including formality, politeness, domain, and orthography.

Key Benefits

Classification Rigor: LVTag brings systematic organization to language tagging by providing clear, separate dimensions for different types of variation. Unlike existing subtags and systems that mix different categories at the same level, LVTag maintains strict separation between formality, politeness, domain, and other dimensions.

Standards Compatibility: LVTag is fully compliant with BCP 47 (RFC 5646) and works seamlessly with:

Technology Integration: LVTag tags can be used directly in:

Use Cases:

Rationale

While BCP 47 provides excellent support for identifying languages, scripts, and regions, it lacks standardized mechanisms for capturing sociolinguistic variation within a language. Current standards don’t address:

LVTag fills these gaps using BCP 47’s private-use extension mechanism (-x-), providing a systematic, machine-readable way to encode these critical dimensions of language variation while maintaining full backward compatibility.

Precise Language Classification

The advent of large language models and sophisticated NLP tools has made precise language variety classification not just useful but essential. Modern systems need to:

LVTag provides the granular metadata need to understand not just what language is being used, but how it’s being used, enabling more nuanced and appropriate language processing pipelines.

Format Specification

Basic Structure

language-x-[classifier]-[value]-[classifier2]-[value2]...

Where:

Magic Tags

LVTag supports both long-form and short-form “magic” classifiers for flexibility:

Long Form Short Form Description
ortho w Orthographic variant
form f Formality level (1-5 scale)
polite p Politeness/respect level (1-5 scale)
domain d Specialized vocabulary or professional context
geo g Geographic or regional variety
proto a Proto-language or reconstructed language
hist h Historical period or stage of a language
genre e Text genre or literary style
medium m Communication medium (spoken, written, digital)
socio s Sociolect or social group variety
modality o Mode of language production
register r Linguistic register
pragma u Communicative function
temporal t Temporal marking
evidence v Information source
affect k Emotional tone
age n Age/generation variety
gender i Gender variety
expert b Expertise level
interact 2 Interactional structure
prosody y Prosodic features
lexical l Lexical density (0-100)
syntax z Syntactic complexity (0-100)
start 0 Start date (ISO 8601 without punctuation)
end 1 End date (ISO 8601 without punctuation)
taboo j Taboo/vulgar content level (0-5 scale)
conf c Confidence score (0-100) for previous tag
q, 3-9 Reserved for future use

Classifiers

1. Orthography Classifier (ortho or w)

Identifies specific orthographic conventions or writing system variants beyond standard script tags.

Format:

Examples (combined with standard script tags):

2. Formality Classifier (form or f)

Identifies the formality level of language use.

Format:

Formality scale:

Examples:

3. Politeness Classifier (polite or p)

Identifies the politeness/respect level of language use.

Format:

Politeness scale:

Examples:

4. Domain Classifier (domain or d)

Identifies specialized vocabulary or professional context.

Format:

Examples:

5. Geographic Classifier (geo or g)

Identifies regional or geographic language varieties.

Format:

Examples:

6. Proto Classifier (proto or a)

Identifies proto-languages or reconstructed historical languages.

Format:

Rules:

Examples using ISO 639-5 codes:

Examples without ISO 639-5 codes (descriptive, longer than three characters):

Note:

7. Historic Classifier (hist or h)

Identifies historical periods or stages of a language.

Format:

Examples:

8. Genre Classifier (genre or e)

Identifies text genre or literary style.

Format:

Examples:

9. Medium Classifier (medium or m)

Identifies the communication medium.

Format:

Examples:

10. Socio Classifier (socio or s)

Identifies sociolect or social group varieties.

Format:

Examples:

11. Modality Classifier (modality or o)

Identifies the fundamental mode of language production.

Format:

Examples:

12. Register Classifier (register or r)

Identifies the linguistic register or functional variety of language use.

Format:

Examples:

13. Pragmatic Function Classifier (pragma or u)

Identifies the communicative function or speech act.

Format:

Examples:

14. Temporal Marking Classifier (temporal or t)

Identifies temporal aspects or tense usage patterns.

Format:

Examples:

15. Evidentiality Classifier (evidence or v)

Identifies information source marking.

Format:

Examples:

16. Affect/Emotion Classifier (affect or k)

Identifies emotional tone or affect.

Format:

Examples:

17. Age/Generation Classifier (age or n)

Identifies age-related or generational language varieties.

Format:

Examples:

18. Gender Classifier (gender or i)

Identifies gender related language varieties.

Format:

19. Expertise Level Classifier (expert or b)

Identifies level of domain expertise on a 0-10 scale.

Format:

Expertise scale:

Examples:

20. Interactional Structure Classifier (interact or 2)

Identifies conversational or interactional patterns.

Format:

Examples:

21. Prosodic Features Classifier (prosody or y)

Identifies prosodic or suprasegmental features.

Format:

Examples:

22. Lexical Density Classifier (lexical or l)

Identifies lexical density as a numeric value (0-100).

Format:

Examples:

23. Syntactic Complexity Classifier (syntax or z)

Identifies syntactic complexity as a numeric value (0-100).

Format:

Examples:

24. Start Date Classifier (start or 0)

Identifies the start date of language use (ISO 8601 format without punctuation).

Format:

Date formats:

Examples:

25. End Date Classifier (end or 1)

Identifies the end date of language use (ISO 8601 format without punctuation).

Format:

Date formats:

Examples:

26. Taboo Classifier (taboo or j)

Identifies level of taboo, vulgar, or offensive content.

Format:

Examples:

27. Confidence Classifier (conf or c)

Indicates confidence score for the immediately preceding classifier.

Format:

Special behavior:

Examples:

Multiple Classifications

LVTag supports multiple classifiers in a single tag to provide precise language identification. Both long and short forms can be mixed:

ko-x-form-4-domain-business
ko-x-f-4-d-business
ko-x-form-4-polite-2-domain-business
ko-x-f-4-p-2-d-business

Examples above show Korean with informal formality (4) but polite speech (2) in business context.

Valid Values

Note: All values must be 8 characters or shorter to comply with BCP 47 subtag length restrictions. While specific values for many classifiers are to be established through expert usage and community consensus, the numeric scales, date formats, and basic values listed below are defined in this standard.

Formality Scale (Universal)

Level Description Examples
1 Most formal Legal documents, official ceremonies, academic papers
2 Formal Business letters, news articles, presentations
3 Neutral Standard conversation, email, general writing
4 Informal Casual conversation, personal blogs, text messages
5 Most casual Slang, intimate conversation, social media

Politeness Scale (Universal)

Level Description Examples
1 Most respectful Royal address, religious leaders, elderly respect
2 Very polite Customer service, formal meetings, teachers
3 Polite/neutral Standard interactions, colleagues
4 Familiar Friends, peers, casual acquaintances
5 Intimate/plain Close family, intimate partners

Expertise Scale (Universal)

Level Description
0 No knowledge
1-2 Beginner
3-4 Intermediate
5-6 Advanced
7-8 Expert
9-10 Master/Authority

Taboo Scale (Universal)

Level Description
0 No taboo content
1 Mild taboo
2 Light taboo
3 Moderate taboo
4 High taboo
5 Extreme taboo

Lexical Density Scale (Universal)

Level Description
0-20 Very low density
21-40 Low density
41-60 Moderate density
61-80 High density
81-100 Very high density

Syntactic Complexity Scale (Universal)

Level Description
0-20 Very simple
21-40 Simple
41-60 Moderate complexity
61-80 Complex
81-100 Very complex

Domain Values

Value Description
legal Legal terminology
med Medical terminology
tech Technical/IT
business Business/corporate
fin Finance/banking
acad Academic/scholarly
sci Scientific/research

Implementation Examples

Single Classifier (Long Form)

# Most formal Korean
ko-x-form-1

# Very polite Japanese
ja-x-polite-2

# Legal English
en-x-domain-legal

# Gyeongsang Korean
ko-x-geo-gyeong

# Proto-Indo-European
x-proto-ine

Single Classifier (Short Form)

# Most formal Korean
ko-x-f-1

# Very polite Japanese
ja-x-p-2

# Legal English
en-x-d-legal

# Gyeongsang Korean
ko-x-g-gyeong

# Proto-Indo-European
x-a-ine

Multiple Classifiers

# Informal but polite Korean business language
ko-x-form-4-polite-2-domain-business
ko-x-f-4-p-2-d-business

# Formal and respectful Japanese medical language
ja-x-form-1-polite-1-domain-med
ja-x-f-1-p-1-d-med

# Southern Vietnamese with neutral formality, polite speech, technical domain
vi-x-geo-southern-form-3-polite-2-domain-tech
vi-x-g-southern-f-3-p-2-d-tech

# Complex classification with multiple dimensions
en-x-h-middle-e-poetry-m-written-f-1
ja-x-f-2-p-1-d-med-h-kobun-m-written

# Language varieties showing formality/politeness distinction
ko-x-f-5-p-2  # Very casual but polite (to older friend)
ko-x-f-1-p-4  # Very formal but familiar (written to peer)
ja-x-f-4-p-1  # Casual formality but highest respect
en-x-f-5-j-4  # Very casual English with high taboo level

Use Cases

  1. Language Learning Applications
    • Teach appropriate register for different social contexts
    • Provide domain-specific vocabulary training
  2. Machine Translation
    • Maintain register consistency in translations
    • Apply domain-specific terminology
  3. Content Classification
    • Automatically categorize text by formality and domain
    • Route content to appropriate reviewers or systems
  4. Corpus Linguistics
    • Build tagged corpora for linguistic research
    • Study register and domain variation

Validation Rules

  1. Subtag Length: Each subtag after x- must be 8 characters or fewer
  2. Order: Classifiers can appear in any order after x-
  3. Uniqueness: Each classifier type should appear only once per tag (except conf which can appear multiple times)
  4. Case: Tags should be lowercase (case-insensitive per BCP 47)
  5. Magic Tags: Short form tags are single characters; q, 3-9 are reserved for future use
  6. Mixing: Long and short forms can be mixed within the same tag
  7. Proto Tags: Must start with x- and SHOULD use ISO 639-5 codes when available (e.g., x-proto-sla not x-proto-slavic)
  8. Confidence: The conf/c classifier applies to the immediately preceding classifier
  9. Numeric Values: Must be within defined ranges (0-5 for taboo, 0-10 for expertise, 0-100 for percentage values)
  10. Date Format: Dates use ISO 8601 without punctuation (YYYY, YYYYMM, or YYYYMMDD)

Compatibility

LVTag format is fully compatible with:

Benefits

  1. Precision: Enables fine-grained language variety identification
  2. Extensibility: New registers and domains can be added
  3. Standards-based: Built on established BCP 47 private-use mechanism
  4. Machine-readable: Systematic format enables automated processing
  5. Human-readable: Clear, descriptive subtags
  6. Flexibility: Support for both verbose long-form and concise short-form tags
  7. Brevity: Short magic tags enable compact representation while maintaining clarity

Future Extensions

LVTag is designed to evolve with the needs of the language technology community. We welcome suggestions for new classifiers, improvements to existing ones, and real-world implementation feedback.

To propose extensions or contribute to the specification:

Reserved single-character codes (q, 3-9) are available for future standardized extensions.

References


License and Patent Grant

This specification is released under the CC0 1.0 Universal (Public Domain Dedication).

Why CC0: To ensure maximum adoption and implementation freedom, LVTag is placed in the public domain. This means:

Patent Grant: Any patents covering the LVTag specification are hereby licensed royalty-free for any implementation that complies with this specification.

No Endorsement: Use of LVTag does not imply endorsement by the specification authors.

To the extent possible under law, Danslav Slavenskoj has waived all copyright and related or neighboring rights to the Language Variant Tag (LVTag) Format Specification. This work is published from: United States of America.