Skip to content

Commit c4381d0

Browse files
committed
Added taxonomy docs, and cmb todo
Migrated the taxonomy docs over. Signed-off-by: JJ Asghar <awesome@ibm.com>
1 parent d7feb5b commit c4381d0

File tree

8 files changed

+864
-2
lines changed

8 files changed

+864
-2
lines changed

docs/cmb/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
TODO

docs/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
title: Introduction
3-
description: Introduction docs.instructlab.ai
2+
title: Welcome to InstructLab!
3+
description: The overview of 🐶 InsturctLab.
44
logo: images/ilab_dog.png
55
---
66
# Welcome to the 🐶 InstructLab Project

docs/taxonomy/index.md

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
title: Welcome to InstructLab's Taxonomy
3+
description: The overview of 🐶 InsturctLab's Taxonomy.
4+
logo: images/ilab_dog.png
5+
---
6+
## Welcome to the InstructLab Taxonomy
7+
8+
InstructLab 🐶 uses a novel synthetic data-based alignment tuning method for
9+
Large Language Models (LLMs.) The "**lab**" in Instruct**Lab** 🐶 stands for
10+
[**L**arge-Scale **A**lignment for Chat**B**ots](https://arxiv.org/abs/2403.01081) [1].
11+
12+
The LAB method is driven by taxonomies, which are largely created manually and
13+
with care.
14+
15+
This repository contains a taxonomy tree that allows you to create models
16+
tuned with your data (enhanced via synthetic data generation) using the LAB 🐶
17+
method.
18+
19+
[1] Shivchander Sudalairaj*, Abhishek Bhandwaldar*, Aldo Pareja*, Kai Xu, David D. Cox, Akash Srivastava*. "LAB: Large-Scale Alignment for ChatBots", arXiv preprint arXiv: 2403.01081, 2024. (* denotes equal contributions)
20+
21+
## Choosing domains for the taxonomy
22+
23+
In general, we use the Dewey Decimal Classification (DDC) System to determine our domains (and subdomains) in the taxonomy. This [DDC SUMMARIES document](https://www.oclc.org/content/dam/oclc/dewey/resources/summaries/deweysummaries.pdf) is a great resource for determining where a topic might be classified.
24+
25+
If you are unsure where to put your knowledge or compositional skill, create a folder in the `miscellaneous_unknown` folder under the `knowledge` or `compositional_skills` folders.
26+
27+
## Learning
28+
29+
Learn about the concepts of "skills" and "knowledge" in our [InstructLab Community Learning Guide](https://github.com/instructlab/community/blob/main/docs/README.md).
30+
31+
## Taxonomy tree Layout
32+
33+
The taxonomy tree is organized in a cascading directory structure. At the end of
34+
each branch, there is a YAML file (qna.yaml) that contains the examples for that
35+
domain. Maintainers can decide to change the names of the existing branches or to add new branches.
36+
37+
!!! important
38+
Folder names do not have spaces. Use underscores between words.
39+
40+
## Taxonomy diagram
41+
42+
!!! note
43+
These diagrams shows a subset of the taxonomy. It is not a complete representation.
44+
45+
```mermaid
46+
flowchart TD;
47+
na[not accepting contributions\n at this time]:::na
48+
taxonomy --> foundational_skill & compositional_skills & knowledge
49+
50+
foundational_skill:::na --> reasoning:::na
51+
reasoning:::na --> common_sense_reasoning:::na
52+
reasoning:::na --> mathematical_reasoning:::na
53+
reasoning:::na --> theory_of_mind:::na
54+
55+
compositional_skills --> engineering
56+
compositional_skills --> grounded
57+
compositional_skills --> lingustics
58+
59+
grounded --> grounded/arts
60+
grounded --> grounded/geography
61+
grounded --> grounded/history
62+
grounded --> grounded/science
63+
64+
knowledge --> knowledge/arts
65+
66+
knowledge --> knowledge/miscellaneous_unknown
67+
knowledge --> knowledge/science
68+
knowledge --> knowledge/technology
69+
knowledge/science --> animals --> birds --> black_capped_chickadee --> black_capped_chikadee-a & black_capped_chikadee-q
70+
knowledge/science --> astronomy --> constellations --> phoenix --> phoenix-a & phoenix-q
71+
72+
black_capped_chikadee-a{attribution.txt}
73+
black_capped_chikadee-q{qna.yaml}
74+
phoenix-a{attribution.txt}
75+
phoenix-q{qna.yaml}
76+
classDef na fill:#EEE
77+
```
78+
79+
Below is an illustrative directory structure to show this layout:
80+
81+
```ascii
82+
.
83+
└── linguistics
84+
├── writing
85+
│ ├── brainstorming
86+
│ │ ├── idea_generation
87+
| │ └── qna.yaml
88+
│ │ attribution.txt
89+
│ │ ├── refute_claim
90+
| │ └── qna.yaml
91+
│ │ attribution.txt
92+
│ ├── prose
93+
│ │ ├── articles
94+
│ │ └── qna.yaml
95+
│ │ attribution.txt
96+
└── grammar
97+
└── qna.yaml
98+
│ attribution.txt
99+
└── spelling
100+
└── qna.yaml
101+
attribution.txt
102+
```
103+
## Contribute knowledge and skills to the taxonomy
104+
105+
The ability to contribute to a Large Language Model (LLM) has been difficult in no small part because it is difficult to get access to the necessary compute infrastructure.
106+
107+
This taxonomy repository will be used as the seed to synthesize the training data for InstructLab-trained models. We intend to retrain the model(s) using the main branch following InstructLab's progressive training on a regular basis. This enables fast iteration of the model(s), for the benefit of the open source community.
108+
109+
By contributing your skills and knowledge to this repository, you will see your changes built into an LLM within days of your contribution rather than months or years! If you are working with a model and notice its knowledge or ability lacking, you can correct it by contributing knowledge or skills and check if it's improved after your changes are built.
110+
111+
While public contributions are welcome to help drive community progress, you can also fork this repository under [the Apache License, Version 2.0](LICENSE), add your own internal skills, and train your own models internally. However, you might need your own access to significant compute infrastructure to perform sufficient retraining.
112+
113+
## Ways to Contribute
114+
115+
You can contribute to the taxonomy in the following two ways:
116+
117+
1. Adding new examples to **existing leaf nodes**:
118+
2. Adding **new branches/skills** corresponding to the existing domain:
119+
120+
For more information, see the [Ways of contributing to the taxonomy repository](https://github.com/instructlab/taxonomy/blob/main/CONTRIBUTING.md#ways-of-contributing-to-the-taxonomy-repository) documentation.
121+
122+
## How to contribute skills and knowledge
123+
124+
To contribute to this repo, you'll use the *Fork and Pull* model common in many open source repositories. You can add your skills and knowledge to the taxonomy in multiple ways; for additional information on how to make a contribution, see the [Documentation on contributing](CONTRIBUTING.md). You can also use the following guides to help with contributing:
125+
126+
- Contributing using the [GitHub webpage UI](docs/contributing_via_GH_UI.md).
127+
- Contributing knowledge to the taxonomy in the [Knowledge contribution guidelines](docs/knowledge-contribution-guide.md).
128+
129+
### Why should I contribute?
130+
131+
This taxonomy repository will be used as the seed to synthesize the training
132+
data for InstructLab-trained models. We intend to retrain the model(s) using the main
133+
branch as often as possible (at least weekly).
134+
Fast iteration of the model(s) benefits the open source community and enables model developers who do not have access to the necessary compute infrastructure.
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
---
2+
title: Knowledge Contribution Guidelines
3+
description: The overview of 🐶 InsturctLab's Knowledge contribution guidelines
4+
logo: images/ilab_dog.png
5+
---
6+
7+
You can create a Git repository to host your knowledge contributions anywhere (GitLab, Gerrit, etc.) but it may be favorable to create one on GitHub. The following instructions show you how to create a knowledge repository in GitHub and contribute to the taxonomy.
8+
9+
## Prerequisites
10+
11+
- You have a GitHub account
12+
- You have a forked copy of the [taxonomy](https://github.com/instructlab/taxonomy/tree/main) repository
13+
- Verify that the model does not already know the knowledge you want to submit
14+
15+
## Creating your own knowledge repository
16+
17+
To create a new GitHub repository, follow the GitHub documentation in [Creating a new repository](https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-new-repository).
18+
19+
The specific steps are listed as follows:
20+
21+
1. In your GitHub profile page, navigate to the repositories tab. You will see a search bar where you can search your repositories, or create a new one.
22+
2. This takes you to a page titled “Create a new repository”. Create a custom name for your repository and add a README.md file. For example, “knowlege_contributions” could be a good name for your repository.
23+
3. Click “Create” when you are all set.
24+
25+
## Convert your knowledge documentation to markdown
26+
27+
There are many online tools that can help you convert your documents to markdown. If you are using a wiki page for your contributions, you can use [pandocs](https://pandoc.org/try/) to convert the documents. For wikipedia sources on pandoc, use `from: mediawiki` and convert `to: markdown_strict` to access the proper markdown format.
28+
29+
## Add the markdown file to your repository
30+
31+
To add a file to your GitHub repository, follow the GitHub documentation in [Adding a file to a repository](https://docs.github.com/en/repositories/working-with-files/managing-files/adding-a-file-to-a-repository).
32+
33+
The specific steps are listed as follows:
34+
35+
1. Navigate to “Add files”. Click “Create new file” if you want to manually add your markdown content. Click “Upload files” if you have a file locally to add.
36+
2. Add a description and commit your changes.
37+
38+
Since this is your own repository, you can commit directly to the `main` branch.
39+
40+
3. You can then see your new content in your repository.
41+
42+
!!! important
43+
Make a note of your commit SHA; you need it for your `qna.yaml`.
44+
45+
## Create a pull request in the taxonomy repository
46+
47+
Navigate to your forked taxonomy repository and ensure it is up-to-date.
48+
49+
There are a few ways you can create a pull request:
50+
51+
- For details on the local process, check out [The GitHub Workflow Guide](https://github.com/kubernetes/community/blob/master/contributors/guide/github-workflow.md) in the kubernetes documentation and the [GitHub flow](https://docs.github.com/en/get-started/using-github/github-flow) in the GitHub documentation.
52+
- For details on contributing using the GitHub webpage UI, see [Contributing using the GH UI](https://github.com/instructlab/taxonomy/docs/contributing_via_GH_UI.md) or [Creating a pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request?tool=webui) in the GitHub documentation.
53+
54+
## Verification
55+
56+
Here are a few things to check before seeking reviews for your contribution:
57+
58+
- Your `qna.yaml` follows the proper formatting. See examples in [Knowledge: YAML examples](https://github.com/instructlab/taxonomy/blob/main/README.md#knowledge-yaml-examples)
59+
- Ensure all parameters are set. Especially the `document`, `repo`, `commit` and `pattern` keys; these parameters are specific to knowledge contributions and require more analysis.
60+
- Include an `attribution.txt` file for citing your sources. see [For your attribution.txt file](https://github.com/instructlab/taxonomy/blob/main/CONTRIBUTING.md#for-your-attributiontxt-file) for more information.
61+
62+
## PR Upstream Workflow
63+
64+
The following table outlines the expected timing for the PR(s) you have put in. The PRs go through a few steps, and checks, but you should be able to map your `label` to
65+
the place that it is in.
66+
67+
| Label | Actor | Action | Duration |
68+
| --- | --- | --- | --- |
69+
| | Contributor | Submit PR | - |
70+
| | Contributor | Fix failed PR checks | - |
71+
| https://github.com/instructlab/taxonomy/labels/triage-needed | Triager | Review PR, ask for changes | Days |
72+
| https://github.com/instructlab/taxonomy/labels/triage-requested-changes | Contributor | Make requested changes | Days |
73+
| https://github.com/instructlab/taxonomy/labels/precheck-generate-ready | Triager | Run prechecks and generate | Days |
74+
| https://github.com/instructlab/taxonomy/labels/community-build-ready | Backend | Model gets retrained | Weeks |
75+
| | Triager | Check the numbers and PR merged or closed | - |

docs/taxonomy/knowledge/guide.md

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
---
2+
title: Knowledge Guide
3+
description: The overview of 🐶 InsturctLab's knowledge
4+
logo: images/ilab_dog.png
5+
---
6+
# What is "Knowledge"?
7+
8+
Knowledge consists of data and facts and is backed by documents. When you create knowledge for a model, you're giving it additional data to more accurately answer questions.
9+
10+
Knowledge contributions in this project contain a few things.
11+
12+
- A file in a git repository that holds your information. For example, these repositories can include markdown versions of information on: Oscar 2024 winners, Law books, Shakespeare, Sports, Chemistry, etc.
13+
- A `qna.yaml` file that asks and answers questions about the information in the git repository.
14+
- A `attribution.txt` that includes the sources for the information used in the `qna.yaml`.
15+
16+
You can learn more about the knowledge structure in [Getting Started with Knowledge contributions](https://github.com/instructlab/taxonomy/blob/main/README.md#getting-started-with-knowledge-contributions).
17+
18+
## Accepted Knowledge
19+
20+
!!! important
21+
We are currently only accepting knowledge contributions as a limited private beta and sources will be limited to articles from Wikipedia.
22+
23+
There are a few domains of knowledge that we are currently accepting. For a full list of knowledge fields, see [Knowledge domains](https://github.com/instructlab/taxonomy/blob/main/knowledge/knowledge_domains.md) in the taxonomy documentation
24+
25+
A few examples are as follows:
26+
27+
### STEM fields
28+
29+
- Physics
30+
- Astronomy and Astrophysics
31+
- Quantum Mechanics
32+
- Special Relativity and General Relativity
33+
34+
- Chemistry & Chemical Engineering
35+
- Organic Chemistry
36+
- Inorganic Chemistry
37+
- Chemical engineering
38+
- Biotechnology
39+
40+
- Earth & Environmental Science
41+
- Geology
42+
- Geography
43+
44+
- Biology & Life Sciences
45+
- Plants (Botany)
46+
- Medicine & health
47+
48+
- Electrical Engineering
49+
- Bioengineering
50+
- Civil Engineering
51+
- Industrial Engineering
52+
53+
### Legal and regulatory
54+
55+
- Intellectual Property
56+
- Criminal Law
57+
- Civil Rights
58+
- Healthcare compliance
59+
60+
### Economy and Business
61+
62+
- Economy and Businesses
63+
- Accounting and Finance
64+
- Marketing
65+
- Human Resource
66+
- Management
67+
68+
### Philosophy
69+
70+
- Philosophy
71+
- Metaphysics
72+
- Epistemology
73+
- Ethics
74+
- Parapsychology & occultism
75+
- Philosophical schools of thought
76+
77+
### Literature
78+
79+
- Literature, rhetoric & criticism
80+
- American literature in English
81+
- Other literatures
82+
83+
## Avoid These Topics
84+
85+
While the tuning process may eventually benefit from being used to help the models work with complex social topics, at this time this is an area of active research we do not want to take lightly. Therefore please keep your submissions clear of the following topics:
86+
87+
- PII (personally identifiable information) or any content invasive of individual privacy rights
88+
- Violence including self-harm
89+
- Cyber Bullying
90+
- Internal documentation or other that is confidential to your employer or organization, e.g. trade secrets
91+
- Discrimination
92+
- Religion
93+
- Facts such as, "[Christianity is, according to the 2011 census, the fifth most practiced religion in Nepal, with 375,699 adherents, or 1.4% of the population](https://en.wikipedia.org/wiki/Christianity_in_Nepal)", are fine as a knowledge contribution. Advocating in favor of or against any religious faith is not acceptable.
94+
- Medical or health information
95+
- Facts such as, "[In mammals, pulmonary ventilation occurs via inhalation (breathing)](https://opentextbc.ca/biology/chapter/11-3-circulatory-and-respiratory-systems/)," are fine as a knowledge contribution. Tailored medical/health advice is not acceptable.
96+
- Financial information
97+
- Facts such as "[laissez-faire economics ... argues that market forces alone should drive the economy and that governments should refrain from direct intervention in or moderation of the economic system](https://openstax.org/books/world-history-volume-2/pages/6-3-capitalism-and-the-first-industrial-revolution)," are fine as a knowledge contribution. Tailored financial advice is not acceptable.
98+
- Legal settlements/mitigations
99+
- Gender Bias
100+
- Hostile Language, threats, slurs, derogatory or insensitive jokes or comments
101+
- Profanity
102+
- Pornography and sexually explicit or suggestive content
103+
- Any contributions that would allow for automated decision making that affect an individual's rights or well-being, e.g. social scoring
104+
- Any contributions that engage in political campaigning or lobbying
105+
106+
We are also not accepting submissions of the following content:
107+
108+
- Code
109+
- Anything code-related that can be traced back to code for a computer. Not limited to `sed` or `bash` but `yaml`s for OpenShift or Kubernetes, to `python` snippets to `Java` suggestions. There are specific models focused on this space and this isn't for this model for the time being.
110+
- Jokes
111+
- Poems
112+
113+
We received many joke and poem submissions at the beginning of the project, and with jokes being "in the eye of the beholder" and puns requiring nuance for native English speakers, we realized we were possibly unconsciously biasing our model. We have discovered that working with both topics has its own challenges, and if we want something generalized, finding consensus was unsuccessful. For now, we're not accepting additional submissions of jokes and poems.
114+
115+
## Building Your LLM Intuition
116+
117+
LLMs have inherent limitations that make certain tasks extremely difficult, like doing math problems. They're great at other tasks, like creative writing. And they could be better at things like logical reasoning.
118+
119+
An LLM with knowledge helps it create a basis of information that it can learn from, then you can teach it to use this knowledge via the `qna.yaml` files.
120+
121+
For example, you can give an LLM the entire periodic table, then in a `qna.yaml` add something like:
122+
123+
```yaml
124+
question: What is the symbol and atomic number for Chlorine?
125+
answer: |
126+
The symbol for chlorine is Cl and the atomic number is 17.
127+
```
128+
129+
With a few of these qna's, the model will learn the periodic table because it has the knowledge data.
130+
131+
### LLMs are great at
132+
133+
For these, however, it's common for LLMs to already have excellent performance. Try 3-5 examples in `lab chat` to confirm a deficit in the model before you build your submission, and share the examples in your Pull Request (PR).
134+
135+
- Brainstorming
136+
- Creativity
137+
- Connecting information
138+
- Cross-lingual behavior
139+
140+
### LLMs need help with
141+
142+
LLM behavior in these sorts of topics are very difficult for the model to get right. Try several examples to understand the nuances of the model's ability to do these sorts of tasks, and consider using corrections to the results you get in your tuning process.
143+
144+
- Chains of reasoning
145+
- Analysis
146+
- Story plots
147+
- Reassembling information
148+
- Effective and succinct summaries
149+
150+
### LLMs are not so great at
151+
152+
LLMs may struggle with solving math and computation. That said, improving some of these foundational skills may be something this work tackles in the future, but not at this time.
153+
154+
- Math
155+
- Computation
156+
- "Turing-complete" type tasks
157+
- Generating only true real-world information (they're prone to hallucinations)

0 commit comments

Comments
 (0)