[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #10935(reviewing survey)

Opened 3 months ago

Last modified 23 hours ago

Show all old forum posts

Reported by: srl Owned by: tbishop
Component: unknown Data Locale:
Phase: dsub Review: srl
Weeks: Data Xpath:
Xref:

ticket:9845

Description

Per https://unicode.org/cldr/trac/ticket/9845#comment:18 the current ST only shows the current (n) and past (n-1) forums. It doesn't let you view all forums.

  • view all forums, in history

(an idea was DB design change to support all forums in a single DB structure , but this is not necessary)

Attachments

cldr-redact.sql (891 bytes) - added by tbishop 4 days ago.
sql commands for redacting db
cldr-make-forum.sql (1.4 KB) - added by tbishop 4 days ago.
sql commands for creating table
cldr-redact.2.sh (2.6 KB) - added by tbishop 3 days ago.
script for importing and redacting db, revised yet again
cldr-redact.sh (2.6 KB) - added by tbishop 3 days ago.
script for importing and redacting db, revised yet again, same name
Screen Shot 2018-05-19 Smoketest French Forum.png (145.7 KB) - added by tbishop 3 days ago.
example of forum on smoketest after revision

Change History

comment:1 Changed 3 months ago by kristi

  • Keywords Forum added

comment:2 Changed 3 months ago by mark

  • Owner changed from anybody to backend
  • Priority changed from assess to major
  • Status changed from new to accepted
  • Milestone changed from UNSCH to 34

comment:3 Changed 2 weeks ago by tbishop

  • Owner changed from backend to tbishop

comment:4 Changed 2 weeks ago by tbishop

The goal as I now understand it is to combine the old and new forums into a single forum (per locale) and stop moving to a new forum for each version.

This change may effectively solve some other tickets including:

https://unicode.org/cldr/trac/ticket/10264 Reply from old forum post has a dangling parent ID
https://unicode.org/cldr/trac/ticket/10261 Old forum post date/time stamp takes you to the current Forum page
https://unicode.org/cldr/trac/ticket/10323 ST: Filipino forum does not load

This ticket and https://unicode.org/cldr/trac/ticket/10973 (Emoji keywords disputes due to additional keywords) are currently my top priorities.

Last edited 2 weeks ago by tbishop (previous) (diff)

comment:5 Changed 11 days ago by tbishop

We can change FORUM_POSTS.isVersioned to false and change FORUM_POSTS.hasBeta to false by adding "(false, false)" to the constructor as follows:

   public enum Table {
        VOTE_VALUE, VOTE_VALUE_ALT, VOTE_FLAGGED, FORUM_POSTS(false, false), REVIEW_HIDE, REVIEW_POST;

That gives the unversioned table name CLDR_FORUM_POSTS. That should accomplish the main goal of this ticket. Remaining tasks:

  • a one-time operation to populate the renamed table with a copy of CLDR_FORUM_POSTS_34, plus whatever can be imported into it from CLDR_FORUM_POSTS_33, CLDR_FORUM_POSTS_32, ...
  • removal of code for the obsolete menu item Old Forum Posts
  • simplification of forum-related code that needlessly employs cldrVersion, surveyLastVoteVersion, etc.
Last edited 8 days ago by tbishop (previous) (diff)

comment:6 follow-up: ↓ 7 Changed 8 days ago by srl

should we add a column to track which version a post was from? It seems that whether a discussion is current or not may still be relevant.

comment:7 in reply to: ↑ 6 ; follow-up: ↓ 9 Changed 8 days ago by tbishop

Replying to srl:

should we add a column to track which version a post was from? It seems that whether a discussion is current or not may still be relevant.

The posts do have dates already. I don't have enough experience with the forums to judge whether adding the version will be helpful. It doesn't sound hard to implement.

comment:8 follow-up: ↓ 10 Changed 8 days ago by tbishop

Based on cldr-DUMP-20180514.sql, cldr_forum_posts_26 has UTF-8 encoding problems, so I propose to skip it.

cldr_forum_posts_27, cldr_forum_posts_29, cldr_forum_posts_31, and cldr_forum_posts_34 are all empty.

That leaves:

  • cldr_forum_posts_28: 3,862 rows, id = 1, 2, 3, ..., 3862
  • cldr_forum_posts_30: 5,845 rows, id = 1, 2, 3, ..., 5845
  • cldr_forum_posts_32: 12,412 rows, id = 1, 2, 3, ..., 12412
  • cldr_forum_posts_33: 1,520 rows, id = 1, 2, 3, ..., 1520

Need to convert 'id' so it will be unique when the tables are merged. Maybe keep 1..3862 in 28, then add 3862 to id in 30, add 3862+5845 to id in 32, add 3862+5845+12412 to id in 33. Apply same conversion to 'parent' when not -1.

'first_time' is always '1999-12-31 23:59:59' -- does it serve a purpose?

comment:9 in reply to: ↑ 7 Changed 7 days ago by srl

Replying to tbishop:

Replying to srl:

should we add a column to track which version a post was from? It seems that whether a discussion is current or not may still be relevant.

The posts do have dates already. I don't have enough experience with the forums to judge whether adding the version will be helpful. It doesn't sound hard to implement.

The version is an entirely different targetted version of CLDR for the survey tool, so I think it would be good to keep track of which version the discussion pertained to. Anyway, if the tables are being merged, it would preserve the loss of information. Yes, as above, the id is not unique globally .

comment:10 in reply to: ↑ 8 ; follow-up: ↓ 11 Changed 7 days ago by srl

Replying to tbishop:

Based on cldr-DUMP-20180514.sql, cldr_forum_posts_26 has UTF-8 encoding problems, so I propose to skip it.

What kind of UTF-8 problems? Perhaps I didn't do the dump properly.
So, mysql's utf-8 wasn't really utf-8, so I didn't use it originally (this was before utf8mb4 was supported). Text is in a binary blob containing utf-8, see DBUtils.getStringUTF8()

cldr_forum_posts_27, cldr_forum_posts_29, cldr_forum_posts_31, and cldr_forum_posts_34 are all empty.

That leaves:

  • cldr_forum_posts_28: 3,862 rows, id = 1, 2, 3, ..., 3862
  • cldr_forum_posts_30: 5,845 rows, id = 1, 2, 3, ..., 5845
  • cldr_forum_posts_32: 12,412 rows, id = 1, 2, 3, ..., 12412
  • cldr_forum_posts_33: 1,520 rows, id = 1, 2, 3, ..., 1520

Need to convert 'id' so it will be unique when the tables are merged. Maybe keep 1..3862 in 28, then add 3862 to id in 30, add 3862+5845 to id in 32, add 3862+5845+12412 to id in 33. Apply same conversion to 'parent' when not -1.

sounds reasonable.

'first_time' is always '1999-12-31 23:59:59' -- does it serve a purpose?

I think first_time was supposed to have the earliest and latest time of a thread, but is not used in the code. It could be removed.

comment:11 in reply to: ↑ 10 Changed 7 days ago by tbishop

Replying to srl:

Replying to tbishop:

Based on cldr-DUMP-20180514.sql, cldr_forum_posts_26 has UTF-8 encoding problems, so I propose to skip it.

What kind of UTF-8 problems? Perhaps I didn't do the dump properly.
So, mysql's utf-8 wasn't really utf-8, so I didn't use it originally (this was before utf8mb4 was supported). Text is in a binary blob containing utf-8, see DBUtils.getStringUTF8()

For parts of cldr_forum_posts_26 only, cldr-DUMP-20180514.sql appears to have Latin1. Examples: "Costaricaanse colón"; "górnomaryjski"; "1st quotation level: “...”", "make a change here and add letter &quot;\ã&quot; to the auxiliary letters for ET.<p><p>[\á \à \â \å ? \ã \æ \ç \é \è \ê \ë ? \í \ì \î \ï ? ñ ó ò ? ô ø ? œ ú ù û ?]"; "We can use °F instead".

Some other parts of cldr_forum_posts_26 don't make sense as Latin1. Examples: "Modern Standard Arabic. The correct value is: modern\á štandardn\á arab?ina"; "As per the source correct translation is �Ǩ͟ ���ǰ � ���Dz�ǰ �Ͱ��� ���". Those parts aren't valid UTF-8 either. Restoration might be possible, but would it be worth the investment? These posts are from 2014. The goal has been described as preservation of "a few years" of forum history.

Except for cldr_forum_posts_26, all of cldr-DUMP-20180514.sql appears to have correct UTF-8; at least there are no invalid byte sequences, and scrolling through it I see correct-looking non-ASCII text like "zamračený muž", though I haven't viewed all 409 megabytes.

Removal of the first_time column, and addition of a cldr_version column, should both be straightforward to implement if desired. Questions:

  • Where would cldr_version be displayed in the interface?
  • For new posts, would it be set to SurveyMain.getNewVersion()?
Last edited 7 days ago by tbishop (previous) (diff)

comment:12 Changed 7 days ago by tbishop

[Deleted old version -- see attached cldr-make-forum.sql for current version]

Last edited 4 days ago by tbishop (previous) (diff)

comment:13 Changed 6 days ago by tbishop

Code changes for this ticket are in branches/tbishop/t10935_forum.

Almost finished.

Still working on storing the CLDR version number for new posts, and displaying the CLDR version for all posts.

I've just realized that SurveyMain.getNewVersion() is a string, not an integer, though it generally (always?) represents an integer like "33". The table column should probably also be a string; I'll change it from int(11) to varchar(122), OK?

mysql gives warnings like these on cldr-DUMP-20180514.sql:

Warning (Code 1300): Invalid utf8 character string: 'F36E27'

I need to confirm these are limited to cldr_forum_posts_26, or fix them if they affect the new table.

cldr-DUMP-20180514.sql has some utf8 that ideally might be utf8mb4:
/*!40101 SET NAMES utf8 */;
...
/*!40101 SET character_set_client = utf8 */;
...
Let's check the mysql configuration on the server. Generally mysql max-3-byte so-called utf8 should be abolished in favor of utf8mb4. Using blobs can help to reduce related problems. Still any occurrence of "utf8" in mysql is worrisome.

Last edited 3 days ago by tbishop (previous) (diff)

comment:14 Changed 6 days ago by tbishop

cldr-DUMP-20180514.sql contains only valid UTF-8, with the exception of the table cldr_forum_posts_26.

The mysql "Invalid utf8" warnings do occur for other parts of the sql file, for valid 4-byte UTF-8 codes which would be valid utf8mb4. In spite of those warnings the import appears to works correctly since the UTF-8 is in blobs.

On the smoketest server, tomcat/conf/context.xml includes url="jdbc:mysql://localhost:3306/cldrdb?character_set_server=utf8mb4&character_set_connection=utf8mb4&characterSetResults=utf8" and a comment referencing http://stackoverflow.com/questions/24389862/mysql-connectorj-character-set-results-does-not-support-utf8mb4 ... While the server configuration merits further study, it looks OK for this ticket, though I wonder where character_set_system comes from for "show variables":

| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8

Last edited 5 days ago by tbishop (previous) (diff)

Changed 4 days ago by tbishop

sql commands for redacting db

Changed 4 days ago by tbishop

sql commands for creating table

comment:15 Changed 4 days ago by pedberg

  • Priority changed from major to critical
  • Type changed from unknown to survey

comment:16 Changed 4 days ago by tbishop

Plan to run cldr-redact.sh and cldr-make-forum.sql​ on smoketest

Plan to run only cldr-make-forum.sql​ on st.unicode.org

Changed 3 days ago by tbishop

script for importing and redacting db, revised yet again

Changed 3 days ago by tbishop

script for importing and redacting db, revised yet again, same name

comment:17 Changed 3 days ago by tbishop

I ran the scripts on smoketest and the results look OK. Next I'll commit the changes from branches/tbishop/t10935_forum to trunk and test the new forum table on smoketest.

comment:18 Changed 3 days ago by tbishop

Committed to trunk as revision 14125 (<http://unicode.org/cldr/trac/changeset/14125>)

http://unicode.org/cldr/trac/build successful.

Tested on smoketest.

Go to French. No more "Old Forum Posts". Click "Forum". Looks good! Versions are shown with posts: [v28], [v30], [v32], and [v33] (not in French but in German) all occur as expected. Created new forum post for Afar. Filipino forum doesn't crash, and it includes posts from multiple versions.

Last edited 3 days ago by tbishop (previous) (diff)

comment:19 Changed 3 days ago by tbishop

  • Status changed from accepted to reviewing
  • Review set to srl

Changed 3 days ago by tbishop

example of forum on smoketest after revision

comment:20 Changed 2 days ago by tbishop

For testing on smoketest, I suggest especially testing links from posts to data points and vice versa, and forum-related emails. I'm not very familiar with those features or how to test them. The guide has: "To create a Forum post linked to a special data point, click on the item ...", and "Another way to check for posts that may need your attention is to review email notifications ..."

http://cldr.unicode.org/index/survey-tool/guide#TOC-Forum

BTW there's a lot of forum-related code that's probably not used anymore, maybe obsolete implementations related to old table formats, like sf_posts. The unused code and outdated tables should be cleaned up soon, for the sake of clarity and efficiency. I'm not doing it immediately, for the sake of caution.

Last edited 2 days ago by tbishop (previous) (diff)

comment:21 Changed 23 hours ago by tbishop

Ran on st.unicode.org:

mysql surveytool --user=surveytool -p --show-warnings --default-character-set=utf8mb4 < cldr-make-forum.sql

New table cldr_forum_posts has 23639 rows as expected.

Here's the changeset again: http://unicode.org/cldr/trac/changeset/14125

Ready for production!

View

Add a comment

Modify Ticket

Action
as reviewing
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.