⭐ Starlight Tools / Text Diff Viewer

Compare Text Differences

Paste original text on the left and new text on the right. Click “Compare” to highlight differences.

Differences:

How Does Text Comparison (Diffing) Work?

Text comparison, often called "diffing," is the process of identifying the differences between two versions of a text document or code. It's a fundamental tool in programming, document management, and version control, allowing users to quickly see what has changed. While it might seem like magic, the underlying process relies on clever algorithms.

The Core Idea: Finding the Longest Common Subsequence

Most diff algorithms, including the one used in this tool, are based on finding the **Longest Common Subsequence (LCS)**. Imagine you have two sequences (in our case, lines or words of text). An LCS is the longest sequence of items that appear in the same order in both, but not necessarily contiguously.

For example, given "ABCDEFG" and "AXBCYDG", the LCS is "ABCDG".

Once the LCS is found, the differences become apparent:

  • Anything in the original text that is *not* part of the LCS is a **deletion**.
  • Anything in the new text that is *not* part of the LCS is an **addition**.
  • Parts that are common (the LCS) are considered unchanged.

A Simplified Step-by-Step Process (for words or lines):

  1. Tokenization: The first step is to break down the input texts into smaller units, or "tokens." This tool operates on a "word" basis, meaning it compares individual words. Other diff tools might compare character by character, or line by line.
    • Example: "Hello world" and "Hi there world" would be tokenized into words: ["Hello", "world"] and ["Hi", "there", "world"].
  2. Comparison Algorithm: A dynamic programming algorithm (like Myers' diff algorithm or the Hunt-Szymanski algorithm) is typically used to efficiently find the LCS. This involves building a matrix or table that tracks the similarities between the two sequences of tokens.
    • The algorithm calculates "costs" for insertions, deletions, and matches. It tries to find the path through the comparison matrix that results in the lowest total cost (i.e., the fewest changes).
  3. Identifying Changes: Once the optimal path is determined, the algorithm can retrace its steps to identify whether a token was:
    • Removed: Present in the original text but not in the new text (marked in red/strike-through in this tool).
    • Added: Present in the new text but not in the original text (marked in green).
    • Unchanged: Present in both, and part of the common sequence.
  4. Output Formatting: Finally, the differences are presented in a user-friendly format, often side-by-side or inline, with visual cues (like colors) to highlight the changes.

Challenges and Considerations:

  • Granularity: Should the diff be word-by-word, character-by-character, or line-by-line? This impacts the "resolution" of the differences. Word-level diffs are generally good for readability, while character-level is more precise for code.
  • Speed and Memory: For very large files, diffing algorithms can be computationally intensive. Efficient implementations are crucial.
  • Moved Blocks: Some advanced diff tools can detect when entire blocks of text have been moved from one location to another, rather than just showing them as deletions and insertions.

By understanding this process, you can appreciate the sophistication behind simply "spotting the difference" in documents or code.