星期二, 九月 24, 2013

Sort on multiple columns

Data flood in NGS era. With a small data set, sorting can be done gracefully just with the sort+uniq commands. This turns to be painstakingly for NGS data. For example, I have now 177,706,969 lines of records of SNP found for dozens of ID. I want to found unique positions according to their scaffolds and base pair positions. Before the following command can be used:

cat data | sort -k1,2 |uniq -c >some-file
However, this might take hours to finish. While using the following C++ codes can finishes them in just a few minutes.
#include <iostream>
#include <string>
#include <map>

class rcd
{
public:
  std::string scf;
  int    bp;
  rcd(std::string s, int b):scf{s}, bp{b}{};
};

bool operator<(const rcd&a, const rcd&b)
{
  return (a.scf==b.scf)?a.bp<b.bp:a.scf<b.scf;
}

using namespace std;

int main(int argc, char *argv[])
{
  map<rcd,int> pos;
  string ts;
  int t;
  while(cin>>ts>>t) ++pos[rcd(ts,t)];

  clog<<pos.size()<<endl;
  for(auto t=pos.begin(); t!=pos.end(); ++t)
    cout<<t->first.scf<<' '<<t->first.bp<<' '<<t->second<<'\n';
  return 0;
}
Then,
g++ -std=c++11 sort.cpp
cat data | ./a.out > data.new
This program will sort the data and count to data.new and put overall unique record number to standard out.

By modifying number of fields and corresponding comparison operator, sorting on multiple columns can be implemented easily.

没有评论: