Leetcode Repeated DNA Sequences

今天突然看到Leetcode上出了一个新题, 就立刻着手试试;

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",Return:["AAAAACCCCC", "CCCCCAAAAA"].

题意: 所有DNA都是由一系列碱基构成, 分别为ACGT, 题目要求找出所有长度为10的子串, 这些子串在原串中出现次数必须大于1次(重复出现);

思路历程:

— 这是一个子串对自身匹配的问题;

— 第一反应有3种方法: 其一, 使用hash(c++11的unordered_set)来存储所有长度为10的子串; 具体步骤是构造unordered_set<string> repeated, 遍历输入的原串, 对s[i]到s[i+9]的序列构成的子串, 如未出现在repeated中, 则存入repeated; 如出现在repeated中, 则说明该子串曾出现过, 符合题意要求, 将其存入答案vector<string> answer;

当然, 这个代码最后是memory limit exceeded; 先看代码, 后面分析;

class Solution {public:vector<string> findRepeatedDnaSequences(string s) {vector<string> answer;unordered_set<string> repeated;for (int i = 0; i + 9 < s.size(); ++i) {string t(s, i, 10);unordered_set<string>::iterator it = repeated.find(t);if (it != repeated.end())answer.push_back(t);//在repeated中查找成功, 说明曾出现过该子串, 则存入answerelserepeated.insert(t);//repeated中查找失败, 说明未曾出现过, 则存入repeated}return answer;}};//memory limit exceeded

分析:

时间复杂度是O(n), 其中n为原串长度;

MLE的可能原因: unordered_set<string>对于超长的输入串, 会消耗大量的存储空间; 另外, 上面代码并未考虑answer中的重复答案, 因为每次只要出现在repeated中就放入answer, 这显然有问题;

改进方法:

— 由于碱基无非ACGT四种类型, 可以使用00 01 10 11四个状态代替ACGT四种碱基, 比如AAGCT就是00 00 10 01 11, 对任意一个长度为10的子串都可以转化使用20个位的int值hint; 这样就可将unordered_set<string> repeated转变为unordered_set<int> repeated, 一定程度上减少了所需的存储空间;

— 对于如何去重, 其一可以先收集所有答案, 再sort, unique去重, 当然这样很慢也很麻烦; 其二, 可以再构造一个unordered_set<int> check, 用于存储已经存入answer中的重复子串对应的hint值;

— 值得一提的是, 每次从s[i]->s[i+9]变为s[i+1]->s[i+10], 使用了这样一个方法:hint = ((hint & eraser) << 2) + ati[s[i]];其中eraser是一个宏定义, 值为0x3ffff, 二进制是00111111111111111111, 用于擦除hint中的最高2位s[i]碱基对应的值, 再左移2, 最后加上s[i+10]的碱基对应的值;

代码如下:

class Solution {public:#define eraser 0x3ffffvector<string> findRepeatedDnaSequences(string s) {vector<string> answer;int hint = 0;//存储长度为10的子串翻译后的int值if (s.size() < 10)return answer;unordered_set<unsigned int> repeated, check;//repeated存储已出现的子串, check用于防止重复答案unordered_map<char, unsigned int> ati{{'A', 0}, {'C', 1}, {'G', 2}, {'T', 3}};//此处ati是存储各碱基对应的值00 01 10 11(c++11新语法)for (int i = 0; i != 10; ++i) {hint = (hint << 2) + ati[s[i]];//用s的前10个碱基构造初始hint值}repeated.insert(hint);for (int i = 10; i != s.size(); ++i) {hint = ((hint & eraser) << 2) + ati[s[i]]; //子串变化对应hint值变化unordered_set<unsigned int>::iterator it = repeated.find(hint);if (it != repeated.end()) {it = check.find(hint);if (it == check.end()) {answer.push_back(string(s, i – 9, 10));check.insert(hint);}}elserepeated.insert(hint);}return answer;}};分析:

一开始由于忽略了移位与其他运算符的优先级关系, 一直出问题, 后来才发现hint = ((hint & eraser) << 2) + ati[s[i]];里面的外面那层括号没加上, 浪费了不少时间, 确实不应该

最后:

— 在代码中用到了unordered_set和unordered_map等散列容器, 不知道是否有O(n)算法可以不适用这些方法, 如果有请大家告诉我;

— 一开始其实还有一种想法, 就是枚举每个长度为10的子串, 使用KMP算法, 时间复杂度应该是O(n2), 但占用的存储空间为O(1), 不知道效果会怎么样, 如果有朋友试过, 也请与我讨论一下~

，生命中,每一种苦难的背后都有一片晴朗的天空

相关文章：

你感兴趣的文章：

标签云：