Find all the repeating substrings in a given String (or DNA chromosome sequence)

Technology CommunityCategory: StringsFind all the repeating substrings in a given String (or DNA chromosome sequence)
VietMX Staff asked 3 years ago

Suffix Tree is the way to go.

  • If you analyze the output for the string "AAAAAAAAAAAAAA", then there are O(n2)characters in it, so the algorithm is at least O(n2).
  • To achieve O(n2), just build the suffix tree for each suffix of s (indices 1..n, 2..n, 3..n, …, n..n). It doesn’t matter if one of the strings has no own end node, just count how often each node is used.
  • At the end, iterate over each node with count > 1 and print its path.

There are multiple way to construct the suffix tree:

  1. Using Ukkonen’s algorithm.
  2. Using Suffix array and LCP (longest common prefix array)