Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”,

Return:
[“AAAAACCCCC”, “CCCCCAAAAA”].


最直接的想法是用一个hashset来存放每个见过的string,但是这样会memory limit exceeded.
Solution 1:用rolling hash对每个长度为10的字符串算出一个hash value。

public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        Set<String> result = new HashSet<String>();
        if (s == null || s.length() < 10) {
            return new ArrayList<String>();
        }
        int targetHash = 0, resHash = 0, base = 7, multiplies = 1;
        for (int i = 0; i < 10; i++) {
            targetHash = targetHash * base + s.charAt(i);
            multiplies *= base;
        }
        multiplies /= base;
        Set<Integer> set = new HashSet<Integer>();
        set.add(targetHash);
        for (int i = 1; i + 10 <= s.length(); i++) {
            targetHash = (targetHash - multiplies * s.charAt(i-1)) * base + s.charAt(i + 9);
            if (set.contains(targetHash)) {
                result.add(s.substring(i, i+10));
            } else {
                set.add(targetHash);
            }
        }
        return new ArrayList<String>(result);
    }
}

Solution 2:由于字符串里只有四种字符,因此每个字符用2 bits就可以表示,10个字符即20 bits,一个integer即可表示成功。

public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        List<String> result = new ArrayList<String>();
        Map<Character, Integer> map = new HashMap<Character, Integer>();
        map.put('A', 0);
        map.put('C', 1);
        map.put('G', 2);
        map.put('T', 3);
        if (s == null || s.length() < 10) {
            return new ArrayList<String>();
        }
        Set<Integer> unique = new HashSet<Integer>();
        Set<Integer> set = new HashSet<Integer>();
        for (int i = 0; i + 10 <= s.length(); i++) {
            int value = 0;
            for (int index = i; index < i+10; index++) {
                value = (value << 2) | map.get(s.charAt(index));
            }
            if (set.contains(value) && !unique.contains(value)) {
                result.add(s.substring(i, i+10));
                unique.add(value);
            } else {
                set.add(value);
            }
        }
        return result;
    }
}

Leave a comment