Note: This is Part 2 of the Language Learning: Japanese series; please refer to the link for other related tutorials.
Making Kanji Lists
(Note: Please read the Introduction, and Part 1 first.)
This entry will teach you to how make (automated) kanji lists.
Why this method may appeal to you: You will not have to hunt down the meanings of individual kanji any further.
By making a list of kanji beforehand, the contents of which may number in the hundreds (even thousands, if need be), you save time and energy. Better still, the kanji list would contain characters in the order that they appear in the actual text, and, the list may, if you want it to, be comprised only of uniques (i.e. there won’t be any duplicates).
This method is immensely helpful especially when you are dealing with a text that contains many (as of yet) unknown kanji characters.
What You Will Have In Your Hands
Have a look:
The above is a snippet of the kanji list from the example parallel text I made for Part 1 of the series.
Tools, Places, And Instructions
1. You need Editpad Pro, because we will be playing with regular expressions (regex) after this. Also, you should have OpenOffice Calc. (I put these apps as requirements in the Introduction.)
2. Professor Jim Breen’s WWWJDIC. (Yes, that site will be your best friend from now on.)
3. Ten Nights of Dreams, by Natsume Souseki. (Hosted on Aozora.)
I would advocate that you trace the steps I’ve outlined below closely; the reason being that the regular expressions I made can be quite difficult to decipher if you don’t know what they are or how they function.
(A quick note to the regex wizards, if there be any reading this entry: please don’t laugh at my regex functions! I realize that mine are rather primitive in form. One could probably render them more efficiently. And yet, they suffice for the task at hand. That is why I have chosen to leave them as is. Still, should have you any suggestions in mind, I’d be most pleased to have a go at them.)
The Steps
You should be able to complete the following steps in less than 15 minutes. If you’re confused, click on the links; the screen captures are there to help you.
Here we go:
1. Select and copy (shortcut key: CTRL-V) the Japanese text. (At this point, make sure that Notepad2 is using the UTF-8 file encoding! Click File -> Encoding -> UTF-8.)
2. Fire up Editpad Pro. Paste into it what you had copied.
3. Open the Search bar. (Shortcut key: CTRL-F.) The ‘Regular expression’ box should be ticked.
4. In the Search field, insert this code:
[ぁ-ヿ0-90-9]|[。-゚\n.、” \-\s()。※(「」+―]
The ‘Replace’ field should be left empty.
And now, for the cool part: click ‘Highlight’. This has the effect of highlighting those sections the regex function covers.
5. Now press ‘Replace All’. All non-ideographic elements would be removed, leaving you then with just the Chinese characters. (Many of them will be duplicates.)
Now copy everything. (Shortcut key: CTRL-A, then CTRL-C.)
(An explanation as to why this step is necessary: If we feed the search process with the original text as is, we increase the possibility of a timeout error. So, by stripping the text of all but kanji, we are able to feed more kanji (thus, obtaining more output as well), yet at the same time we also minimize the possibility of a timeout error.)
6. Head on to WWWJDIC’s Kanji Database.
7. Paste what you had copied into the text field, and hit Search. The kanji characters, their search-codes and definitions will be enumerated, en masse.
Unfortunately, most of what had been generated is superfluous for our purposes. (After all, we only need the kanji characters and their respective definitions.) So we need to remove the extras.
For now, just select and copy all that’s on the page. (Shortcut key: click anywhere, then press CTRL-A.)
8. Go back to Editpad Pro, and paste what you had copied. Now, do the whole regex-replace thing again, but use this code instead:
(\s\d[\d\w].*([あ-ん]|[ア-ン])|\sSODA|\sSOD)|(^[(a-z)(A-Z)\n-].*)
9. And we have a kanji list! What’s great is that the kanji(s) and their definitions (inclusive of duplicates) appear in the exact order they do in the original text. If you’re satisfied with this text list, you’re already done. If not, continue.
10. Still in Editpad: We need to make sure the data is formatted nicely, to help OpenOffice Calc import it more easily. So, let’s do another search-replace again, but use the following regex functions:
Search: (?<=\p{lo})((\s|-\s)(?!=([a-z]|)))
Replace: \t
What the above function does: it will select only the whitespaces in between kanji and their definitions, and will convert them into tabs once you hit ‘Replace All’; do just that.
Now, select and copy everything.
(Careful readers may notice that I used a negative lookbehind function as a subexpression of the larger positive lookbehind function. This feels redundant, I know, but, in my humble opinion, it keeps everything safe and dandy, by imposing a double check on both characters to the left and right of the whitespace.)
11. Fire up OpenOffice Calc, and do a Paste. A ‘Text Import’ window appears. Make sure the ‘Tabs’ box is ticked. (And the ‘Space’ box, unticked.)
12. You’re done! (Take note that the list includes duplicates, so if you want to have them removed, read the following section. Otherwise, skip to the last section, the Side-Notes.)
Kanji List - Uniques Only (No Duplicates)
(optional — continued from the previous section)
1. Completely highlight the two columns with data. Then, from the menu, select: Data -> Filter -> Standard Filter.
2. Change the field in the Value drop-down box to ‘not empty’.
3. Click ‘More’, then tick the ‘No duplication’ box. Also, tick the ‘Copy results to…’ box, and insert this value into the field below it:
(-undefined-) -> C1:D1
5. A new list, without duplicates, has been created in the 3rd column and 4th column. Scroll up and down the spreadsheet to check.
Now, just delete data from the 1st and 2nd columns, then move the contents from 3rd and 4th column there.
6. And you’re completely done! All you’ve left to do is to apply the tips from Part 1 to make your spreadsheet look good.
Side-Notes
I understand that some people may question the necessity of being able to create kanji lists my way, when there are already gloss generators (such as this one) out there.
Why I decided to make my own kanji lists: Firstly, the generators out there are great, but I really dislike having duplicates in my printed kanji lists. I want each kanji to appear only once. Secondly, I’d rather that the kanji list be presented in a way that appeals to me, which is something that can’t be done with list generators. Thirdly, I like having control over each step of the output process.
If I’ve helped you out in any way, do let me know.

April 16th, 2008 / 12:41 pm
I use JWPce for listing bKanji in a text. (free)
Maybe you can include it in your next tutorial.
Open file in JWPce (.txt, .euc, .sjs, .new, .old, .nec, .utf 8, .utf 7, UNICODE)
Press on button Count Kanji
Finished
Gives Count Summary of all text.
Gives a list of all Kanji found in text without repetitions.
You can use filters to include info from KanjiDic, Kunyomi, Onyomi, Meaning, Frequency.
Very easy to use.
Keep up the great work.
Tom Hodgers