资源算法 big-data-made-easy

big-data-made-easy

2020-03-02 | |  165 |   0 |   0

AI+BigData+Cloud Made Easy

A list of frameworks, libraries, resources, and shiny things. Inspired by awesome-... stuff. Those most frequently used or well-know items are not listed here, which could be referred from awesome series: Awesome Big Data by Onur Akpolat and The Big-Data Ecosystem Table by Andrea Mostosi .

Projects

Storage Design and Data Structures

  • Db-readings - Readings in Databases .

  • Bitvector - A C++ container-like data structure for storing a vector of bits with fast appending on both sides and fast insertion in the middle, all in succinct space .

  • BitSliceIndex - Experiments on bit-slice indexing .

  • RoaringBitmap - Roaring Bitmap .

  • Pilosa - High performance OLAP based on roaring bitmap .

  • Cpp-btree - C++ in-memory containers based on a B-tree data structure.

  • Graphillion - Fast, lightweight graphset operation library .

  • Emphf - An efficient external-memory algorithm for the construction of minimal perfect hash functions .

  • Skipgraph - Implementation of skipgraph on messagepack-rpc .

  • Splay Map - STL map implemented with splay tree .

  • Cedar - C++ implementation of efficiently-updatable double-array trie .

  • WikiSort - Fast and stable sort algorithm that uses O(1) memory. Public domain .

  • Annoy - Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk .

  • Expgram - An ngram toolkit with succinct storage .

  • Cuckoofilter - A Bloom filter replacement for approximated set-membership queries .

  • DCF - Dynamic Cuckoo Filter .

  • PackedArray - Random access array of tightly packed unsigned integers .

  • FrameOfReference - C++ library to pack and unpack vectors of integers having a small range of values using a technique called Frame of Reference .

  • FFBF - Feed-forward Bloom filters .

  • Concurrent Trees - C++ implementation of concurrent Binary Search Trees .

  • Concurrent B-Tree - A working project for High-concurrency B-tree source code in C .

  • Palmtree - An implementation of Intel's concurrent B+Tree (Palm Tree) .

  • BwTree - An open sourced implementation of Bw-Tree in SQL Server Hekaton .

  • W-TinyLFU - C++11 header-only implementation for Window-TinyLFU Cache .

  • Block-graph - A succinct implementation of a block-graph data structure .

  • RePair-WaveletTree-Graph - Graph Implementation with repair bitmap compressed WaveletTree .

  • RLZ - Contains the RLZ compression and self-index source code .

  • Serangequerying - Space-Efficient Structures for Range Querying .

  • Succinct - Experimentation with various succinct data structures. Combines previous doc-counter and wavelet-tree repos .

  • Sdsl-lite - Succinct Data Structure Library 2.0 .

  • Relative-FMIndex - Relative FM-index which is smaller but slower than plain FMIndex.

  • GCSA - Generalized Compressed Suffix Array.

  • Succinct - A collection of succinct data structures .

  • DYNAMIC - Dynamic succinct/compressed data structures .

  • DPT - Distributed Patricia Trie .

  • Rmq - Implementations of LCA and RMQ data structures from "The LCA Problem Revisited" .

  • YuNomi - Compressed Array Library .

  • DACs - Directly Addressable Codes (DACs) consist in a variable-length encoding scheme for integers that enables direct access to any element of the encoded sequence and obtains compact spaces .

  • Cpi00 - The compressed permuterm index .

  • Smbt - Succinct Multibit Tree for similarity search .

  • Gwt - Graph-indexing wavelet tree for graph similarity search .

  • Webgraphs - Fast and Compact Web Graph Representations .

  • Erika-trie - Erika-trie: succinct trie library .

  • Path_decomposed_tries - Implementation of the data structures described in the paper "Fast Compressed Tries using Path Decomposition" .

  • Sumire-tries - A variety of succinct tries .

  • Trie4j - (Succinct) trie implementation in Java .

  • SuDS - Succinct Data Structures (SuDS) www.cs.helsinki.fi .

  • Marisa-trie - Marisa succinct trie .

  • LibCDS - Compact Data Structures Library .

  • HSDS - Succinct Data Structure Library Collection including bit-vector/wavelet-matrix/trie .

  • BWTIL - BWT Text Indexing Library: a set of tools to work with BWT-based text indexes .

  • Bwt-Merge - A tool for merging large BWTs .

  • PWT - Parallel Wavelet Tree and Wavelet Matrix Construction .

  • PSAC - Parallel Suffix Array, LCP Array, and Suffix Tree Construction .

  • R-Index - Optimal space run-length Burrows-Wheeler transform full-text index .

  • Fbcsa - Fixed Block based Compact Suffix Array .

  • Quantile-Index - Code for "The Quantile Index -- Succinct Self-Index for Top-k Document Retrieval" .

  • Gonzalo Navarro - Publications of Gonzalo Navarro .

  • Kvtx - Transaction over CAS see https://docs.google.com/open?id=0B04zCRiCIQGGZDcyNTEwZGQtODk4Yy00NjEwLWI1MjQtYjc3NzJhN2RlNzk0 .

  • MemC3 - An in-memory key-value cache based on concurrent cuckoo hashing.

  • Libart - Adaptive Radix Trees implemented in C .

  • Masstree - Masstree, a fast, multi-core key-value store .

  • HyPer - A hybrid online transactional processing (OLTP) and online analytical processing (OLAP) high-performance main memory database system that is optimized for modern hardware .

  • HERD - A Highly Efficient key-value system for RDMA .

  • Nldb - Nanolat Database supporting 1M transactions per second .

  • Sophia - Modern embeddable key-value database designed for a high load environment .

  • FOEDUS - Transactional fast optimistic engine optimized for a large number of CPU cores and NVRAM storage (or fast SSD) .

  • FastBit_UDF - MySQL UDF for creating, manipulating and querying FastBit indexes .

  • Jump Consistent Hash - A Go implementation of the jump consistent hash .

  • Content Defined Chunking - High Performance Content Defined Chunking .

  • SSD optimizations - Optimizing SSDs random IOPs, noop/tpps scheduler, rotational=0, add_random=0 .

  • Article-SSD - Coding for SSDs - What every programmer should know about solid-state drives .

  • Article-Key-Value - Implementing a Key-Value Store .

  • Article-MVCC - Implementation of MVCC Transactions for Key-Value Stores .

  • Article-SSD - Solid-state revolution: in-depth on how SSDs really work .

  • DB Redbook - Readings in Database Systems .

Distributed Infrastructure for Cloud---Database and Storage

  • Cockroach - A Scalable, Geo-Replicated, Transactional Datastore .

  • TiDB - Distributed NewSQL database compatible with MySQL protocol .

  • ElastiCell - Cloud native key-value store with strong consistency and reliability .

  • Yugabyte - Cloud native database store with strong consistency and reliability .

  • FBase - Cloud native database store with strong consistency and reliability by JD.

  • Paxosstore - Cloud native key value store with strong consistency and reliability by WeChat.

  • Phxqueue - A high-availability, high-throughput and highly reliable distributed queue based on the Paxos algorithm.

  • Youzan-nsq - Youzan's modification of nsq to provide cloud native capability from reliability to auto rebalancing.

  • Baidu-Elasticsearch - Baidu's modification of elasticsearch to provide strong data consistency and full SQL.

  • ClickHouse - Yandex's distributed column store OLAP.

  • Palo - Baidu's distributed OLAP based on Google's Mesa paper.

  • MapD - MapD OLAP based on GPU.

  • ContainerFS - Cloud native distributed filesystem for Kubernetes.

  • OpenEBS - Cloud native filesystem for Kubernetes(non-distributed ).

  • Seaweed-FS - Distributed filesystem for small blob files.

  • Ambry - Distributed filesystem for small and large blob files.

  • DistributedLog - High performance replicated log service.

  • Jepsen - Techniques Jepsen occupies a particular niche of the correctness testing landscape .

  • Namazu - Programmable fuzzy scheduler for testing distributed system .

  • GPaxos - Golang Paxos implementation based on Phxpaxos .

  • Consensus-Yaraft - C++ Raft implementation based on Etcd's golang raft .

  • NOPaxos - Network-Ordered Paxos .

  • TAPIR - Building Consistent Transactions with Inconsistent Replication .

  • Phat - An implementation of the Chubby lock service protocol in Msgpack RPC .

  • Hydra - A distributed data processing and storage system originally developed at AddThis .

  • Summingbird - Streaming MapReduce with Scalding and Storm https://twitter.com/summingbird .

  • Hustle - A column oriented, embarrassingly distributed relational event database .

  • MDCC - Multi-DataCenter Consistency protocol .

  • URingPaxos - High throughput atomic multicast protocol .

  • Course-CS6452 - Datacenter Networks and Services .

Distributed Infrastructure for Cloud---Application

  • Pinpoint - Non-intrusive Dapper-like APM solution .

  • CAT - APM solution at Dianping Inc .

  • Brave - Java version of OpenZipkin .

  • Appdash - Golang version of Dapper .

  • Jaeger - Golang version of Dapper in Uber.

  • Cadence - Microservice workflow orchestrator .

  • Zeebe - Microservice workflow orchestrator .

  • F-Stack - Network framework with high performance based on DPDK .

  • DPVS - High performance Layer-4 load balancer based on DPDK .

Distributed Infrastructure for Cloud---A(AI)B(BigData)C(Cloud)

  • Galaxy - Naive scheduler for Baidu search cluster .

  • Cook - Fair job scheduler on Mesos for batch workloads and Spark .

  • Kube-arbitrator - Cluster colocation scheduler for Kubernetes .

  • BigFlow - Baidu dataflow operator .

  • Pulsar - Business level monitor and analysis .

  • Cubert - A fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop .

  • Embulk - A plugin-based parallel bulk data loader that makes painful data integration works relaxed .

  • Gobblin - Data ingestion as a service .

  • Magpie - Deploying and managing a Hadoop Yarn cluster with Docker Swarm .

  • Horovod - Uber's modification of TensorFlow to provide RingReduce based on MPI.

  • Angel - Tencent's parameter server infrastructure to support machine learning.

  • Ytk-Learn - Yuantiku's distributed machine learning platform.

  • Libble - LIBBLE from NJU to provide faster convergence than SGD.

  • Gloo - Facebook's communications library with various primitives for multi-machine training.

  • xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package (C++, Python, R).

  • LASER - A Scalable Response Prediction Platform For Online Advertising .

  • Hivemall - Scalable machine learning library for Hive/Hadoop .

  • Ml-ease - ADMM based large scale logistic regression .

  • Jubatus - Distributed Online Machine Learning Framework .

Concurrency

  • Concurrent Queue - A fast multiple-producer, multi-consumer lock-free concurrent queue for C++11 .

  • CAF - An Open Source Implementation of the Actor Model in C++ .

  • TAMER - C++ extensions for readable event-driven programming .

  • C++React - A reactive programming library for C++11 .

  • Libslock - Cross-platform atomic operations and lock algorithm library http://lpd.epfl.ch/site/ssync .

  • CDS - Header only C++ Concurrent Data Structures library .

  • Libcds - A C++ template library of lock-free and fine-grained algorithms .

  • Locksmith - A library for debugging locking in C, C++, or Objective C programs .

  • Concurrency-concepts - A guide to concurrency, multi-threading and parallel programming concepts. Explains the differences between every concept, their advantages and disadvantages in detail .

  • Concurrency Kit - Concurrency primitives, safe memory reclamation mechanisms and non-blocking data structures for the research, design and implementation of high performance concurrent systems .

  • Nanahan - An implementation of Hopscotch hashing for single thread .

  • Scalex - Code snippets for the workshop on concurrent data structure implementation .

  • CBB - Provides a set of concurrent building blocks (Java & C/C++) that can be used to develop parallel/multi-threaded applications .

  • Thrust - A parallel algorithms library which resembles the C++ Standard Template Library (STL) .

  • Varon-t - A C implementation of Disruptor queues http://varon-t.readthedocs.org/ .

  • Lockfree Queue - Lock-free Condition Wait for Lock-free Multi-producer Multi-consumer Queue, see http://natsys-lab.blogspot.ru/2013/08/lock-free-condition-wait-for-lock-free.html .

  • Ssmem - A simple object-based memory allocator with epoch-based garbage collection, the publication "Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures" .

  • CLHT - A very fast and scalable (lock-based and lock-free) hash table that uses cache-line sized buckets .

  • Comsat - Comsat lets your application enjoy the scalability of asynchronous web-frameworks, serving many thousands of concurrent long-lived connections, or issuing hundreds of web-service calls for each request, all while maintaining the simple “thread per request” model .

  • Quasar-thrift - Quasar fiber based Thrift RPC .

  • Seastar - Concurrency library in user space .

  • Article-TM - Transactional Memory: History and Development .

System Performance And Profiling

Search Engine and Information Retrieval

  • Vespa - Production ready search engine to support web-scale data .

  • SF1R - A distributed massive data engine for enterprise/vertical search written in C++ .

  • BitFunnel - Signature file based search engine from Bing .

  • Trinity - Trinity IR toolkit .

  • IResearch - IR toolkit to be used for ArangoDB .

  • Partitioned_elias_fano - Code used for the experiments in the paper "Partitioned Elias-Fano Indexes" .

  • Clustered_Partitioned_elias_fano - Code used for paper Clustered Elias-Fano Indexes" .

  • Data Structures for Inverted Indexes - Optimal Space-Time Tradeoffs for Inverted Indexes .

  • Surf - SUccinct Retrieval Framework .

  • FastPFor - Fast integer compression .

  • Indexing - Experimenting with indexing on GPUs .

  • Genie - Generic Inverted Index on GPU .

  • Simdcomp - A simple C library for compressing lists of integers .

  • SIMDCompressionAndIntersection - A C++ library to compress and intersect sorted lists of integers using SIMD instructions .

  • TurboPFor - Fastest Integer Compression .

  • Pos-cmp - Comparison framework for positional inverted indexes and self-index supporting phrase queries .

  • MaskedVByte - SIMD-accelerated VByte Compression, Publication "Vectorized VByte Decoding" .

  • Wavelet - Information Retrieval based on Wavelet Tree .

  • Shuffla - Search engine using kd-tree .

  • RoSA - Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays .

  • Dualsorted - Dual sorted inverted index based on Wavelet Tree .

  • Treap - Faster and Smaller Inverted Indices with Treaps .

  • Gigablast - A distributed open source search engine and spider written in C/C++ for Linux .

  • SIMD-Based-Posting-lists - Implementation of Alexander A. Stepanov inverted Index Compression algorithms .

  • Groonga - Open-source fulltext search engine and column store .

  • Atire - A search engine built using the most effective recent research techniques discovered by Information Retrieval researchers around the world .

  • Mg4j - Academic search engine with succinct design(say quasi-succinct indices) .

  • Argos - A structural data search engine .

  • MFRetrieval - Tools for maximum inner product retrieval in recommender systems .

  • Faiss - A library for efficient similarity search and clustering of dense vectors .

  • Lopq - Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark .


上一篇:madewithangular.github.io

下一篇:a14-made-labs4

用户评价
全部评价

热门资源

  • Keras-ResNeXt

    Keras ResNeXt Implementation of ResNeXt models...

  • seetafaceJNI

    项目介绍 基于中科院seetaface2进行封装的JAVA...

  • spark-corenlp

    This package wraps Stanford CoreNLP annotators ...

  • capsnet-with-caps...

    CapsNet with capsule-wise convolution Project ...

  • inferno-boilerplate

    This is a very basic boilerplate example for pe...