back to home

src-d / awesome-machine-learning-on-source-code

Cool links & research papers related to Machine Learning applied to source code (MLonCode)

6,523 stars
838 forks
8 issues

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing src-d/awesome-machine-learning-on-source-code in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind-ai.vercel.app/repo/src-d/awesome-machine-learning-on-source-code)
Preview:Analyzed by RepoMind

Repository Summary (README)

Preview

Awesome Machine Learning On Source Code

Awesome Machine Learning On Source Code CI Status

Awesome Machine Learning On Source Code

Notice: This repository is no longer actively maintained, and no further updates will be done, nor issues/PRs will be answered or attended. An alternative actively maintained can be found at ml4code.github.io repository.

A curated list of awesome research papers, datasets and software projects devoted to machine learning and source code. #MLonCode

Contents

Digests

Conferences

Competitions

  • CodRep - competition on automatic program repair: given a source line, find the insertion point.

Papers

Program Synthesis and Induction

Source Code Analysis and Language modeling

Neural Network Architectures and Algorithms

Embeddings in Software Engineering

Program Translation

Code Suggestion and Completion

Program Repair and Bug Detection

APIs and Code Mining

Code Optimization

Topic Modeling

Sentiment Analysis

Code Summarization

Clone Detection

Differentiable Interpreters

<a name="related-research"></a>

<details> <summary>Related research</summary>

AST Differencing

Binary Data Modeling

Soft Clustering Using T-mixture Models

Natural Language Parsing and Comprehension

</details>

Posts

Talks

Software

Machine Learning

  • Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
  • sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
  • vecino - Finds similar Git repositories.
  • apollo - Source code deduplication as scale, research.
  • gemini - Source code deduplication as scale, production.
  • enry - Insanely fast file based programming language detector.
  • hercules - Git repository mining framework with batteries on top of go-git.
  • DeepCS - Keras and Pytorch implementations of DeepCS (Deep Code Search).
  • Code Neuron - Recurrent neural network to detect code blocks in natural language text.
  • Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
  • Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
  • Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
  • Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
  • Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
  • TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
  • JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
  • Clone Digger - clone detection for Python and Java.
  • Sensibility - Uses LSTMs to detect and correct syntax errors in Java source code.
  • DeepBugs - Framework for learning bug detectors from an existing code corpus.
  • DeepSim - a deep learning-based approach to measure code functional similarity.
  • rnn-autocomplete - Neural code autocompletion with RNN (bachelor's thesis).
  • MindsDB - MindsDB is an Explainable AutoML framework for developers. With MindsDB you can build, train and use state of the art ML models in as simple as one line of code.

Utilities

  • go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
  • bblfsh - Self-hosted server for source code parsing.
  • engine - Scalable and distributed data retrieval pipeline for source code.
  • minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
  • kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
  • wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.
  • Tregex, Tsurgeon and Semgrex - Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions").
  • source{d} models - Machine Learning models for MLonCode trained using the source{d} stack.

Datasets

Credits

Contributions

See CONTRIBUTING.md. TL;DR: create a pull request which is signed off.

License

License: CC BY-SA 4.0