Problems with groups in java regex -
i'm quite sure has simple solution, i've been searching 3 hours , haven't managed find helps me.
i'm writing parser in java using regex , i'm supposed able match decided words, numbers 1-10000 , hex color codes. it's going great matching words, reader isn't reading numbers , color codes whole. example reads input:
down. color #000000.
as:
reading: down returning: down
reading: . returning: dot
reading: returning: whitespace
reading: color returning: color
reading: returning: whitespace
reading: # returning: nothing
reading: 0 returning: number
reading: returning: nothing
reading: f returning: nothing
reading: 2 returning: number
reading: 3 returning: number
reading: 4 returning: number
reading: . returning: dot
so it's able read words color , down whole want doesn't read color code #000000. ideally want 7 lines be:
reading: #0af234 returning: colorcode
i have:
string stringtokens = "down|color|(\\s|\\t)+|\\n|\b[1-9][0-9]{0,3}\b|10000|^(#)([a-fa-f0-9]{6})$"; pattern stringpattern = pattern.compile(stringtokens, pattern.case_insensitive); matcher m = stringpattern.matcher(input); then:
while (m.find()) { if (m.start() != inputpos) { tokens.add(new token(lineno, tokentype.invalid)); } if (m.group().matches("^(#)([a-fa-f0-9]{6})$")) tokens.add(new token(lineno, tokentype.colorcode)); else if (m.group().equals(".")) tokens.add(new token(lineno, tokentype.dot)); else if (m.group().matches("down")) tokens.add(new token(lineno, tokentype.down)); else if (m.group().matches("color")) tokens.add(new token(lineno, tokentype.color)); else if (character.isdigit(m.group().charat(0))) tokens.add(new token(lineno, tokentype.number, integer.parseint(m.group()))); else if (m.group().matches("\\n")) { tokens.add(new token(lineno, tokentype.whitespace)); lineno++; } else if (m.group().matches("(\\s|\\t)+")) tokens.add(new token(lineno, tokentype.whitespace)); inputpos = m.end(); } so question basically:
how manage read groups regarding color codes , numbers together? when print out m.group() each reading now, returns single digits. yet looking @ code digits read in same format, regex above [0-9]+, simple me. each group read whole number.
i have tried use along lines of m.group(1) , m.group(2), used word boundaries (which don't understand completely) , ^$ format, nothing seems work read token whole.
i hope managed keep code copied simple without missing important, , can me figure simple (it must be?!) thing out. thank you! :)
so have regexp:
down|color|(\\s|\\t)+|\\n|\b[1-9][0-9]{0,3}\b|10000|^(#)([a-fa-f0-9]{6})$ that can decompose as:
downcolor(\\s|\\t)++: one or more \s (ok, whitespace class) or \t (not needed \t included in \s)\\n(note included in \s)\b[1-9][0-9]{0,3}\b: ok, here try use word-boundary, not taking account backslashes need escaped in java string, should\\b. not sure why want use that?10000: isn't covered previous pattern?^(#)([a-fa-f0-9]{6})$: (#) seems unnecessary, #. ^...$ you're forcing content of input #abcdabcd, i'd remove it.
how match dot?
since need match again distinguish different types of tokens, why don't use multiple regexp (one each token) (or no regexp @ literals) check against head of string parse.
if matches have new token , can consume matched part of string.
Comments
Post a Comment