Leetcode: Repeated DNA Sequences

题目:

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",Return:["AAAAACCCCC", "CCCCCAAAAA"].

刚开始就想到用map进行子串的存储,结果Memory Limit Exceeded

我刚开始这样写的:

class Solution{public:vector<string> findRepeatedDnaSequences(string s){map<string, int> dnamap;vector<string> result;string::size_type length = s.length();string sequence;for (size_t i = 0; i + 10 < length; i++){sequence = s.substr(i, 10);dnamap[sequence] += 1;}for (std::map<string, int>::iterator it = dnamap.begin(); it != dnamap.end(); ++it){if (it->second > 1){result.push_back(it->first);}}return result;}};

后来看到https://leetcode.com/discuss/24478/i-did-it-in-10-lines-of-c的讨论,原来可以将子字符串转成数字,存储在map中,这样就不会超出限制的内存空间了。

原文是这样说的:The main idea is to store the substring as int in map to bypass the memory limits.

下面是我写的C++代码:

class Solution{public:int str2int(string s) {int num = 0;for (int i = 0; i < s.size(); ++i)//s[i]&7取出s[i]的后三个字节,因为用三个字节可以完全表示A,C,G,Tnum = (num << 3) + (s[i] & 7);return num;}vector<string> findRepeatedDnaSequences(string s) {vector<string> result;unordered_map<int, int> dictionary;string sequence;for (int i = 0; i + 10 <= s.size(); ++i){sequence = s.substr(i, 10);if (dictionary[str2int(sequence)]++ == 1)result.push_back(sequence);}return result;}};

看到很多朋友这道题都使用的是unordered_map,而不是map,然后我查阅了资料,下面是unordered_map和map的区别:

unordered_map原来是boost库中的容器,在C++11标准中被引入到STL中。unordered_map和map的区别就是,map是按照operator<比较判断元素是否相同,以及比较元素的大小,然后选择合适的位置插入到树中。所以,如果对map进行遍历(中序遍历)的话,输出的结果是有序的。顺序就是按照operator< 定义的大小排序。而unordered_map是计算元素的Hash值,根据Hash值判断元素是否相同。所以,对unordered_map进行遍历,结果是无序的。

用法的区别就是,,map的key需要定义operator<。 而boost::unordered_map需要定义hash_value函数并且重载operator==。对于内置类型,如string,这些都不用操心。对于自定义的类型做key,就需要自己重载operator<或者hash_value()了。当不需要结果排好序时,最好unordered_map。

其实,C++中map对于与Java中的TreeMap,而unordered_map对应于Java中的HashMap。

C#代码:

貌似C#中Dictionary不支持对不存在的key的直接索引,所以要先判断key存不存在。C++中如果不存在会自动添加要索引的key。

public class Solution{private int StrToInt(string s){int num = 0;for (int i = 0; i < s.Length; ++i){num = (num << 3) + (s[i] & 7);}return num;}public IList<string> FindRepeatedDnaSequences(string s){IList<string> result = new List<string>();IDictionary<int, int> map = new Dictionary<int, int>();string sequence = string.Empty;int key = 0;for (int i = 0; i + 10 <= s.Length; ++i){sequence = s.Substring(i, 10);key = StrToInt(sequence);if (map.ContainsKey(key)){if (map[key]++ == 1){result.Add(sequence);}}else{map.Add(key, 1);}}return result;}}

Python代码:

注意内置函数ord将单个字符串即字符转成对于的ASCII(Return the integer ordinal of a one-character string)

class Solution:# @param s, a string# @return a list of stringsdef str2int(self, s):num = 0for i in range(len(s)):num = (num << 3) + (ord(s[i]) & 7)return numdef findRepeatedDnaSequences(self, s):length = len(s)result = []if length <= 10:return resultmap = {}i = 0while (i + 10) <= length:sequence = s[i : i + 10]num = self.str2int(sequence)if map.has_key(num):map[num] = map[num] + 1if map[num] == 1:result.append(sequence)else:map[num] = 0i = i + 1return result

自己打败自己的远远多于比别人打败的。

Leetcode: Repeated DNA Sequences

相关文章:

你感兴趣的文章:

标签云: