All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”,
Return:
[“AAAAACCCCC”, “CCCCCAAAAA”].
最直接的想法是用一个hashset来存放每个见过的string,但是这样会memory limit exceeded.
Solution 1:用rolling hash对每个长度为10的字符串算出一个hash value。
public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
Set<String> result = new HashSet<String>();
if (s == null || s.length() < 10) {
return new ArrayList<String>();
}
int targetHash = 0, resHash = 0, base = 7, multiplies = 1;
for (int i = 0; i < 10; i++) {
targetHash = targetHash * base + s.charAt(i);
multiplies *= base;
}
multiplies /= base;
Set<Integer> set = new HashSet<Integer>();
set.add(targetHash);
for (int i = 1; i + 10 <= s.length(); i++) {
targetHash = (targetHash - multiplies * s.charAt(i-1)) * base + s.charAt(i + 9);
if (set.contains(targetHash)) {
result.add(s.substring(i, i+10));
} else {
set.add(targetHash);
}
}
return new ArrayList<String>(result);
}
}
Solution 2:由于字符串里只有四种字符,因此每个字符用2 bits就可以表示,10个字符即20 bits,一个integer即可表示成功。
public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
List<String> result = new ArrayList<String>();
Map<Character, Integer> map = new HashMap<Character, Integer>();
map.put('A', 0);
map.put('C', 1);
map.put('G', 2);
map.put('T', 3);
if (s == null || s.length() < 10) {
return new ArrayList<String>();
}
Set<Integer> unique = new HashSet<Integer>();
Set<Integer> set = new HashSet<Integer>();
for (int i = 0; i + 10 <= s.length(); i++) {
int value = 0;
for (int index = i; index < i+10; index++) {
value = (value << 2) | map.get(s.charAt(index));
}
if (set.contains(value) && !unique.contains(value)) {
result.add(s.substring(i, i+10));
unique.add(value);
} else {
set.add(value);
}
}
return result;
}
}