FASTA loves the first token
Many tools treat the first word after > as the unique ID; everything after the first space is just description—and often gets chopped.
Paste FASTA, DNA, RNA, or protein sequence data to remove headers, line numbers, spaces, hidden formatting, and invalid characters. Validate strict DNA, IUPAC DNA, RNA, or protein residues, then copy or download clean output.
Tips: Ctrl/Cmd + K focuses the input. Validation updates as you type.
Use this FASTA cleaner to remove header lines that begin with > and extract the sequence text underneath.
Multi-FASTA input is supported, so each record can keep its own header for FASTA output or be combined into plain cleaned
sequence output. This is useful when a database export includes long descriptions, accession text, or wrapped sequence lines
that need to be cleaned before BLAST, alignment, primer design, or classroom analysis.
DNA sequence data often arrives with line numbers, spaces, tabs, copied PDF artifacts, or notes mixed into the bases. The
cleaner removes numeric prefixes at the start of lines and can remove all whitespace to produce one continuous sequence.
Choose strict DNA validation to allow only A, C, G, and T, or use IUPAC
mode when ambiguity codes such as N, R, and Y are expected.
Protein sequences use single-letter amino acid codes. In protein mode, the validator accepts the 20 standard residues by
default and can optionally allow X for unknown residues, * for stop, and the rare amino acids
U and O. The summary table counts residues so you can quickly spot unexpected symbols or unusual
composition before using the sequence downstream.
Strict DNA mode is best when a tool expects only the four canonical DNA bases. IUPAC mode is better for consensus sequences,
degenerate primers, mixed reads, and database records that intentionally include ambiguity codes. RNA validation is also
available for A, C, G, and U, with IUPAC RNA mode accepting ambiguity
codes that use U instead of T. When validation fails, the tool reports the first invalid character,
its position, and nearby context.
> header lines and numeric prefixes.| Problem | Dirty input | Cleaned output |
|---|---|---|
| FASTA header | >gene1 description
ATCG |
ATCG |
| Line numbers | 1 ATG CGT 60 |
ATGCGT |
| IUPAC DNA | ATGNNRY |
Valid in IUPAC mode |
| RNA | AUG CCG UUA |
AUGCCGUUA |
| Protein | MVLSPADKTN* |
Valid if stop * enabled |
All cleaning and validation runs locally in your browser. Your sequence data is not uploaded to a server.
Many tools treat the first word after > as the unique ID; everything after the first space is just description—and often gets chopped.
Copying from PDFs can add non-breaking spaces or soft hyphens that look empty but break downstream parsers—cleaning nukes those ghosts.
Genome browsers often lowercase repeats or low-complexity regions; changing case can erase that “masked” hint, so convert consciously.
Selenocysteine (U) and pyrrolysine (O) appear in a few organisms, so this validator keeps them optional instead of allowing them by default.
Ending primers with a G/C boosts 3′ stability; tidy, whitespace-free sequences make it easier to spot a solid clamp and avoid miscounts.