All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”,
Return:
[“AAAAACCCCC”, “CCCCCAAAAA”].
最直接的想法是用一个hashset来存放每个见过的string,但是这样会memory limit exceeded.
Solution 1:用rolling hash对每个长度为10的字符串算出一个hash value。
public class Solution { public List<String> findRepeatedDnaSequences(String s) { Set<String> result = new HashSet<String>(); if (s == null || s.length() < 10) { return new ArrayList<String>(); } int targetHash = 0, resHash = 0, base = 7, multiplies = 1; for (int i = 0; i < 10; i++) { targetHash = targetHash * base + s.charAt(i); multiplies *= base; } multiplies /= base; Set<Integer> set = new HashSet<Integer>(); set.add(targetHash); for (int i = 1; i + 10 <= s.length(); i++) { targetHash = (targetHash - multiplies * s.charAt(i-1)) * base + s.charAt(i + 9); if (set.contains(targetHash)) { result.add(s.substring(i, i+10)); } else { set.add(targetHash); } } return new ArrayList<String>(result); } }
Solution 2:由于字符串里只有四种字符,因此每个字符用2 bits就可以表示,10个字符即20 bits,一个integer即可表示成功。
public class Solution { public List<String> findRepeatedDnaSequences(String s) { List<String> result = new ArrayList<String>(); Map<Character, Integer> map = new HashMap<Character, Integer>(); map.put('A', 0); map.put('C', 1); map.put('G', 2); map.put('T', 3); if (s == null || s.length() < 10) { return new ArrayList<String>(); } Set<Integer> unique = new HashSet<Integer>(); Set<Integer> set = new HashSet<Integer>(); for (int i = 0; i + 10 <= s.length(); i++) { int value = 0; for (int index = i; index < i+10; index++) { value = (value << 2) | map.get(s.charAt(index)); } if (set.contains(value) && !unique.contains(value)) { result.add(s.substring(i, i+10)); unique.add(value); } else { set.add(value); } } return result; } }