Leetcode: Repeated DNA Sequences

题目：

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",Return:["AAAAACCCCC", "CCCCCAAAAA"].

刚开始就想到用map进行子串的存储，结果Memory Limit Exceeded

我刚开始这样写的：

class Solution{public:vector<string> findRepeatedDnaSequences(string s){map<string, int> dnamap;vector<string> result;string::size_type length = s.length();string sequence;for (size_t i = 0; i + 10 < length; i++){sequence = s.substr(i, 10);dnamap[sequence] += 1;}for (std::map<string, int>::iterator it = dnamap.begin(); it != dnamap.end(); ++it){if (it->second > 1){result.push_back(it->first);}}return result;}};

后来看到https://leetcode.com/discuss/24478/i-did-it-in-10-lines-of-c的讨论，原来可以将子字符串转成数字，存储在map中，这样就不会超出限制的内存空间了。

原文是这样说的：The main idea is to store the substring as int in map to bypass the memory limits.

下面是我写的C++代码：

class Solution{public:int str2int(string s) {int num = 0;for (int i = 0; i < s.size(); ++i)//s[i]&7取出s[i]的后三个字节，因为用三个字节可以完全表示A，C，G，Tnum = (num << 3) + (s[i] & 7);return num;}vector<string> findRepeatedDnaSequences(string s) {vector<string> result;unordered_map<int, int> dictionary;string sequence;for (int i = 0; i + 10 <= s.size(); ++i){sequence = s.substr(i, 10);if (dictionary[str2int(sequence)]++ == 1)result.push_back(sequence);}return result;}};

看到很多朋友这道题都使用的是unordered_map，而不是map，然后我查阅了资料，下面是unordered_map和map的区别：

unordered_map原来是boost库中的容器，在C++11标准中被引入到STL中。unordered_map和map的区别就是，map是按照operator<比较判断元素是否相同，以及比较元素的大小，然后选择合适的位置插入到树中。所以，如果对map进行遍历（中序遍历）的话，输出的结果是有序的。顺序就是按照operator< 定义的大小排序。而unordered_map是计算元素的Hash值，根据Hash值判断元素是否相同。所以，对unordered_map进行遍历，结果是无序的。

用法的区别就是，，map的key需要定义operator<。而boost::unordered_map需要定义hash_value函数并且重载operator==。对于内置类型，如string，这些都不用操心。对于自定义的类型做key，就需要自己重载operator<或者hash_value()了。当不需要结果排好序时，最好unordered_map。

其实，C++中map对于与Java中的TreeMap，而unordered_map对应于Java中的HashMap。

C#代码：

貌似C#中Dictionary不支持对不存在的key的直接索引，所以要先判断key存不存在。C++中如果不存在会自动添加要索引的key。

public class Solution{private int StrToInt(string s){int num = 0;for (int i = 0; i < s.Length; ++i){num = (num << 3) + (s[i] & 7);}return num;}public IList<string> FindRepeatedDnaSequences(string s){IList<string> result = new List<string>();IDictionary<int, int> map = new Dictionary<int, int>();string sequence = string.Empty;int key = 0;for (int i = 0; i + 10 <= s.Length; ++i){sequence = s.Substring(i, 10);key = StrToInt(sequence);if (map.ContainsKey(key)){if (map[key]++ == 1){result.Add(sequence);}}else{map.Add(key, 1);}}return result;}}

Python代码：

注意内置函数ord将单个字符串即字符转成对于的ASCII（Return the integer ordinal of a one-character string）

class Solution:# @param s, a string# @return a list of stringsdef str2int(self, s):num = 0for i in range(len(s)):num = (num << 3) + (ord(s[i]) & 7)return numdef findRepeatedDnaSequences(self, s):length = len(s)result = []if length <= 10:return resultmap = {}i = 0while (i + 10) <= length:sequence = s[i : i + 10]num = self.str2int(sequence)if map.has_key(num):map[num] = map[num] + 1if map[num] == 1:result.append(sequence)else:map[num] = 0i = i + 1return result

自己打败自己的远远多于比别人打败的。

相关文章：

你感兴趣的文章：

标签云：