Make Revisions Understandable - A Survey of Edit Intentions, Methods, and Applications

Text revision is a core process in document creation, capturing how authors iteratively refine, reorganize, and improve written content. With the increasing availability of large-scale revision histories from platforms such as Wikipedia and arXiv, NLP research has begun to move beyond modeling what changes are made to understanding why they are made, i.e., the underlying edit intentions. To our knowledge, this is the first survey that synthesizes text revision research through the lens of edit intentions, providing a unified view of datasets, taxonomies, identification methods, and applications. We review prior work across the full revision workflow, including revision corpus construction, edit intention taxonomy design, and edit intention identification. We further categorize representative datasets and methods, summarize downstream applications such as writing assistance and document edit summarization, and highlight key open research directions.

Important

Good news! 🎉 Our survey paper has been successfully accepted by Findings of ACL 2026. 🔥🔥🔥

A curated collection of papers and resources on edit intentions behind text revisions.

Please refer to our survey "Make Revisions Understandable - A Survey of Edit Intentions, Methods, and Applications" for the detailed contents.

Please let us know if you discover any mistakes or have suggestions by emailing us: fangping.lan@temple.edu

Taxonomy

↥ back to top

Paper List

Languages

Multilingual

Granularities

Words/phrases-level

Sentence-level

(2014)Sentence-level rewriting detection
(2015)Annotation and Classification of Argumentative Writing Revisions
(2021)Text Editing by Command

Datasets

Sentence-level

(2014)A Corpus of Sentence-level Revisions in Academic Writing: A Step towards Understanding Statement Strength in Communication
- Link: https://chenhaot.com/pages/statement-strength.html
- Statement strength differences annotation
(2017)A Corpus of Annotated Revisions for Studying Argumentative Writing
- Manually annotated revision corpus
  - Drafts are manually aligned at the sentence level
  - the writer’s purpose for each revision is annotated
  - simulate instructor feedback
- Link: http://argrewrite.cs.pitt.edu/
(2018)WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse
- Multilingual
- Automic insertion edits: instances in which an editor has inserted a single, contiguous span of text into an existing complete sentence
- Link: https://github.com/google-research-datasets/wiki-atomic-edits
(2020) wikiHowToImprove: A Resource and Analyses on Edits in Instructional Texts
- Instructional texts
- Link: https://github.com/irshadbhat/wikiHowToImprove
(2020)Towards Modeling Revision Requirements in wikiHow Instructions
- instruction texts
- Link: https://github.com/irshadbhat/wikiHow_MoRR

Document-level

Multi-level, multi-domain

(2022)Understanding Iterative Revision from Human-Written Text
- Link: https://github.com/vipulraheja/IteraTeR
- the first large-scale, multi-domain, edit-intention annotated corpus of iteratively revised text
- sentence-level and paragraph-level
- contains human annotations and automatic annotations
(2022)ArXivEdits: Understanding the Human Revision Process in Scientific Writing
- document-level, sentence-level and word-level
- annotated intention for sentence-level revisions
- Link: https://tiny.one/arxivedits
(2022)ArgRewrite V.2: an annotated argumentative revisions corpus
- Link: http://argrewrite.cs.pitt.edu/
- sentence-level, subsentential level
(2024) Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision
- Link: https://github.com/UKPLab/re3
- section, paragraph, sentence, and subsentence

Multi-label for one revision

Edits with other features

Including layout

(2012) A Corpus-Based Study of Edit Categories in Featured and Non-Featured Wikipedia Articles
- Link: https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2354
- do not parse the revision text, as we want to include both edits affecting the content and edits affecting the layout
(2013) Automatically Classifying Edit Categories in Wikipedia Revisions

Description of edits

(2019) Modeling the Relationship between User Comments and Edits in Document Revision
- Link: https://github.com/microsoft/WikiCommentEdit
(2022)One Document, Many Revisions: A Dataset for Classification and Description of Edit Intents
- Edits with free-form of description of the edit
- Link: https://tinyurl.com/editsumm
- Distant Supervision for edit-comment generation
(2024) Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision
- Link: https://github.com/UKPLab/re3
- section, paragraph, sentence, and subsentence

Response

(2024) Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision
- Link: https://github.com/UKPLab/re3
- section, paragraph, sentence, and subsentence

Sentence Segmentation

A sentence segmentation tool:
- https://github.com/zaemyung/sentsplit
  - CRF model and regex rules
- https://github.com/irshadbhat/polyglot-tokenizer
  - Tokenizer

Revised Content Alignment

Document(Page) alignment

(2023) SWIPE: A Dataset for Document-Level Simplification of Wikipedia Pages
- the NLI-based SummaC model
- document-level

Paragraph alignment algorithm based on Jaccard similarity

(2022)ArXivEdits: Understanding the Human Revision Process in Scientific Writing

Sentence alignment is not necessarily one-to-one. It can also be one-to-many (Consolidation) and many-to-one (Distribution).

Sentence Alignment:

with TF*IDF score
- (2014)Sentence-level rewriting detection
  - A logistic binary classifier
- (2014)A Corpus of Sentence-level Revisions in Academic Writing: A Step towards Understanding Statement Strength in Communication
  - a dynamic programming algorithm similar to that of Barzilay and Elhadad (2003)

A binary classifier with sentence-level BLEU-4 score
- (2018) WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse
  - Precision-oriented sequence alignment
- (2022)One Document, Many Revisions: A Dataset for Classification and Description of Edit Intents
- (2021)Text Editing by Command
With Jaccard similarity
- (2015) Problems in Current Text Simplification Research: New Data Can Help
A neural CRF sentence alignment model
- (2020)Neural CRF Model for Sentence Alignment in Text Simplification
- (2022)ArXivEdits: Understanding the Human Revision Process in Scientific Writing
  - fine-tined semi-CRF with their dataset
With Levenshtein distance, fuzzy string matching, and semantic similarity measured by SBERT
- (2024) Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision
Bipartite graph with
- An asymmetrical version of the maximum alignment metric
With BERTScore
- (2022)Verba Volant, Scripta Volant: Understanding Post-publication Title Changes in News Outlets
  - BERT-based word alignment score
  - Distinguish whether it is a minor update or a complete rewrite

Word Alignment

(2020)A Supervised Word Alignment Method based on Cross-Language Span Prediction using Multilingual BERT
(2021)Neural semi-Markov CRF for Monolingual Word Alignment
- neural semi-CRF

Edit Extraction

String-mathching based diff algorithm
- dynamic programming for finding the longest common subsequences between two stringsregardless of semantic meaning
- Diff algorithm has many implementations with different heuristics for post-processing
- (2017) Identifying Semantic Edit Intentions from Revisions in Wikipedia
- (2022) Understanding Iterative Revision from Human-Written Text
  - latexdiff package
Treat it as span alignment using simple heuristics
- (2022)ArXivEdits: Understanding the Human Revision Process in Scientific Writing
  - fine-tined semi-CRF with their dataset
  - fine-tuned QA-Aligner
Generative model
- (2024) Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision
  - Llama2-70B with the ICL and CoT

Revision Classification

Binary Classification

(2015)Annotation and Classification of Argumentative Writing Revisions
- Random Forest of the Weka toolkit as classifier
  - for factual or fluency revision
  - for each revision purpose category
- Revision Metrics
  - unweighted precision, recall and F-score

Multi-class classification

Random Forest classifier

(2016)ArgRewrite: A web-based revision assistant for argumentative writings
- based on their work, a multi-class Random Forest classifier was trained to automatically predict the revision purpose type for each extracted revision.

BERT

(2020)Towards Modeling Revision Requirements in wikiHow Instructions

RoBERTa-based classifier

(2022)Understanding Iterative Revision from Human-Written Text
- A RoBERTa-large model from Huggingface transformers which has 354 million parameters

PURE model

(2022)ArXivEdits: Understanding the Human Revision Process in Scientific Writing
- Entity model(BERT) as encoder
- Relation model to predict relation between two span

Sequence labeling

(2016)Using Context to Predict the Purpose of Argumentative Writing Revisions

Multi-label classification

(2013) Automatically Classifying Edit Categories in Wikipedia Revisions
1. transform multi-label classification to one or more single-label classification tasks
2. Hierarchy of multi-label classifiers HOMER
3. Random k-labelsets RAKEL
- Evaluation metrics:
  - example-based (weighting each edit equally) and label-based (weighting each edit category equally) measures
(2016) Who Did What: Editor Role Identification in Wikipedia
- RAkEL
- MLkNN classifier
(2017)Identifying Semantic Edit Intentions from Revisions in Wikipedia
- Using four sets of features
- extract input features using Revision Scoring package
- considered solving them by using single-label classification algorithms and by transforming it into one or more single-label classification tasks
  - [Mulan](Mulan: A java library for multi-label learning)
  - Random k-labelsets RAKEL
  - MLkNN classifier

Generative model

T5 model

(2022)ArXivEdits: Understanding the Human Revision Process in Scientific Writing
- Seq-to-seq model
(2023) SWIPE: A Dataset for Document-Level Simplification of Wikipedia Pages
- fine-tune RoBERTa-large and BART-Large models

Llama2-70B with ICL and CoT

(2024) Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision
- Selecting samples using RoBERTa embeddings

Applications

Interactive text writing

(2021)Text Editing by Command

Iterative text rewriting

(2022)Understanding Iterative Revision from Human-Written Text
- Edit-based model: FELIX
- Generative model: BART and PEGASUS
- Evaluation metrics: SARI, BLEU, and ROUGE-L

Student argumentative writings assistance system

Text simplification

document edit summarization

(2024) Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision
- GPT4 with zero-shot

Fact-facued Sentence modification

(2020)Automatic Fact-Guided Sentence Modification

↥ back to top

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Make Revisions Understandable - A Survey of Edit Intentions, Methods, and Applications

Table of Contents

Taxonomy

Paper List

Languages

Granularities

Datasets

Sentence Segmentation

Revised Content Alignment

Document(Page) alignment

Paragraph alignment algorithm based on Jaccard similarity

Sentence Alignment:

Word Alignment

Edit Extraction

Revision Classification

Binary Classification

Multi-class classification

Sequence labeling

Multi-label classification

Generative model

Applications

Interactive text writing

Iterative text rewriting

Student argumentative writings assistance system

Text simplification

document edit summarization

Fact-facued Sentence modification

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Make Revisions Understandable - A Survey of Edit Intentions, Methods, and Applications

Table of Contents

Taxonomy

Paper List

Languages

Granularities

Datasets

Sentence Segmentation

Revised Content Alignment

Document(Page) alignment

Paragraph alignment algorithm based on Jaccard similarity

Sentence Alignment:

Word Alignment

Edit Extraction

Revision Classification

Binary Classification

Multi-class classification

Sequence labeling

Multi-label classification

Generative model

Applications

Interactive text writing

Iterative text rewriting

Student argumentative writings assistance system

Text simplification

document edit summarization

Fact-facued Sentence modification

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages